I think how the authors of this post think about "AI agents" is really interesting.
Most people think of agents like they think of human employees. They set up a limited number of agents to run in parallel (often just one), with each agent running in a loop and doing one task at a time. They're still in a world where you have a fixed (on the timescale of hours or days) number of employees, each employee can only do one thing at a time, and transferring tasks between employees is slow and costly.
LLMs don't really work like that. You effectively have an infinite number of agents that you can conjure out of thin air at any time. There's no cost advantage to performing LLM requests in series rather than in parallel.
If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious. This is exactly what the authors have done.
I think a better way to think of agents is as "tasks" or "jobs", like those you might find in Celery or sidekik, and apply the learnings from those.
For fun last month I decided to see if i could build a fully functional business of agents. It's 175 python files (employees) build up of internal roles within those files (tasks). So what I have is 175 employees who are able to pass output around each other, understand the work, complete the work, understand where to send the output. The whole system has the ability to do around 275 base processes (same as a business at > 100MM arr) I started on a Friday afternoon and slept a little bit and finished on Monday afternoon. After I had it running I sent it to a VC friend to show them and they sent back the deck of a startup that is in stealth with $25MM doing it the exact same way. With 1 month and a designer and an engineer, I could have it mvp functional for anyone to use ($40k?). Times are changing. Here is kinda how it looks: https://s.h4x.club/9ZuO4XQR / https://s.h4x.club/jkuB8ZED (I've evolved it a little since this, and if you're an engineer and look at my files and think, this guy is a moron: I know!:))
LLMs don't understand. It's mind-boggling to me that large parts of the tech industry think that.
Don't ascribe to them what they don't have. They are fantastic at faking understanding. Don't get me wrong, for many tasks, that's good enough. But there is a fundamental limit to what all this can do. Don't get fooled into believing there isn't.
> LLMs don't understand. It's mind-boggling to me that large parts of the tech industry think that.
I think you might be tied to a definition of "understanding" that doesn't really apply.
If you prompt a LLM with ambiguous instructions, it requests you to clarify (i.e., extend prompt to provide more context) and once you do the LLM outputs something that exactly meets the goals of the initial prompt, does it count as understanding?
If it walks like a duck and quacks like a duck, it's a duck,or something so close to a duck that we'd be better off calling it that.
Your question can be rephrased to “what would an actual difference look like.”
However, what you are asking underneath that, is a mix of “what is the difference” and “what is the PRACTICAL difference in terms of output”
Or in other words, if the output looks like what someone with understanding would say, how is it meaningfully different.
—-
Humans have a complex model of the world underlying their thinking. When I am explaining this to you, you are (hopefully) not just learning how to imitate my words. You are figuring out how to actually build a model of an LLM, that creates intuitions / predictions of its behavior.
In analogy terms, learning from this conversation, (understanding) is to create a bunch of LEGO blocks in your head, which you can then reuse and rebuild according to the rules of LEGO.
One of the intuitions is that humans can hallucinate, because they can have a version of reality in their head which they know is accurate and predicts physical reality, but they can be sick/ill and end up translating their sensory input as indicating a reality that doesn’t exist. OR they can lie.
Hallucinations are a good transition point to move back to LLMs, because LLMs cannot actually hallucinate, or lie. They are always “perceiving” their mathematical reality, and always faithfully producing outputs.
If we are to anthropomorphize it back to our starting point about “LLMs understand”, this means that even when LLMs “hallucinate” or “lie”, they are actually being faithful and honest, because they are not representing an alternate reality. They are actually precisely returning the values based on the previous values input into the system.
“LLMs understand” is misleading, and trojans in a concept of truth (therefore untruth) and other intuitions that are invalid.
—-
However, understanding this does not necessarily change how you use the LLMs 90% of the time, it just changes how you model them in your head, resulting in a higher match between observer reality and your predictive reality.
For an LLM this makes not difference, because its forecasting the next words the same way.
It depends on which human feedback was used to train the model. For humans, there are various communication models like the four-sides model. If the dataset has annotations for the specific facets of the communication model, then an LLM trained on this dataset will have specific probabilities that replicate that communication model. You may call this understanding what the prompter says, but it's just replication for me.
This isn’t a complete answer, but my short list for moving the tech many steps forward would be:
* replying with “I don’t know” a lot more often
* consistent responses based on the accessible corpus
* far fewer errors (hallucinations)
* being able to beat Pokémon reliably and in a decent time frame without any assistance or prior knowledge about the game or gaming in general (Gemini 2.5 Pro had too much help)
So you have two prompts, one is ambiguous and the second is the same prompt but with the ambiguity resolved.
In the first prompt the replicated pattern is to ask for clarification, in the second prompt the replicated pattern is to perform the work. The machine might understand nothing but does it matter when it responds appropriately to the different cases?
I don't really care whether it understands anything at all, I care that the machine behaves as though it did have understanding.
> If it walks like a duck and quacks like a duck, it's a duck,or something so close to a duck that we'd be better off calling it that.
Saying “LLMs match understanding well enough”, is to make the same core error if we were to say “rote learning is good enough” in a conversation about understanding a subject.
The issue is that they can pass the test(s), but they dont understand the work.
This is the issue with a purely utilitarian measure of output.
Argumentum ad populum, I have the impression that most computer scientists, at least, do not find Searle's argument at all convincing. Too many people for whom GEB was a formative book.
> If you define a grammar for a new programming language and feed it to an LLM and give it NO EXAMPLES can it write code in your language?
Yes. If you give models that have a cutoff of 2024 the documentation for a programming language written in 2025 it is able to write code in that language.
In my experience it generally has a very good understanding and does generate the relevant test cases. Then again I don't give it a grammar, I just let it generalize from examples. In my defense I've tried out some very unconventional languages.
Grammars are an attempt at describing a language. A broken attempt if you ask me. Humans also don't like them.
For natural language you are right. The language came first, the grammar was retrofitted to try to find structure.
For formal languages, which programming languages (and related ones like query languages, markup languages, etc) are an instance of, the grammar defines the language. It come first, examples second.
Historically, computers were very good at formal languages. With LLMs we are entering a new age where machines are becoming terrible at something they once excelled at.
Have you lately tried asking Google whether it's 2025? The very first data keeping machines (clocks) were also pretty unreliable at that. Full circle I guess.
YES! Sometimes. You’ll often hear the term “zero-shot generation”, meaning creating something new given zero examples, this is something many modern models are capable of.
Well it depends. For example, calculators have been around for a while and a calculator that only performs better than the average human is not very useful. Sorting algorithms are another example.
Autocomplete/intellisense in an IDE is probably the most salient example. An autocomplete that performs _as well_ as the average programmer is, well, totally useless.
> Hacker news people still think LLMs are just some statistical model guessing things.
That's exactly what they are. It's the definition of what they are. If you are talking about something that is doing something else, then it's not an LLM.
No, it's not. Arrogant developers on HN like to parrot this, but it isn't true.
The power of LLMs are the emergent properties that have emerged which weren't specifically taught to the model. That is not a property that exists in statistical, largely more deterministic models.
If you think it's just a giant token prediction machine you've just ignored the last 5 years.
I don't believe the user meant "understand" in the classical biological and philosophical sense, or were otherwise attempting to anthropomorphize the systems. They were speaking from the practical experience of "this thing takes a somewhat ambiguous input with unique constraints and implements the ask more-or-less as intended".
They understand. Anything able to reason about any arbitrary request and form a plan tailored to that request understands well enough to qualify for the verb. The mechanism behind it may feel hollow or fake. But if its responses reliably show understanding, the LLM understands - by any ordinary measure.
Nearly every argument like this has the same fatal flaw, and it's generally not the critique of the AI, but the critique reflected back on to humans.
Humans also don't understand and are frequently faking understanding, which for many tasks is good enough. There are fundamental limits to what humans can do.
The AI of a few months ago before OpenAI's sycophancy was quite impressive, less so now which means it is being artificially stunted so more can be charged later. It means privately it is much better than what is public. I can't say it "understands," but I can say it outclasses many many humans. There are already numbers of tasks based around understanding where I would already choose an LLM over a human.
It's worth looking at bloom's taxonomy (https://en.wikipedia.org/wiki/Bloom%27s_taxonomy): In the 2001 revised edition of Bloom's taxonomy, the levels were renamed and reordered: Remember, Understand, Apply, Analyze, Evaluate, and Create. In my opinion it is at least human competitive for everything but create.
I used to be very bearish on AI, but if you haven't had a "wow" moment when using one, then I don't think you've tried to explore what it can do or tested it's limits with your own special expertise/domain knowledge, or if you have then I'm not sure we're using the same LLMs. Then compare that experience to normal people, not your peer groups. Compare an LLM to people into astrology, crystal healing, or homeopathy and ask which has more "understanding."
I do agree with you - but the big difference is that humans-who-are-faking-it tend to learn as they go so might, with a bit of effort, be expected to understand eventually.
Does that actually matter? Probably not for many everyday tasks...
The counter was, nope, they don't. They can fake it well though.
Your argument now is, well humans also often fake it. Kinda implying that it means it's ok to claim that LLMs have understanding?
They may outclass people in a bunch of things. That's great! My pocket calculator 20 years also did, and it's also great. Neither understands what they are doing though.
It's fun to talk about, but personally he whole "understanding" debate is a red herring, imo what we actually care about when we talk about intelligence is the capacity and depth of: second order thinking, regardless of the underlying mechanism. I think personally key question isn't "do LLMs understand?" but, "can LLMs engage in second order thinking?" The answer seems to be yes - they can reason about reasoning, plan their approaches, critique their own outputs, and adapt their strategies, o1 has shown us that with RL and reasoning tokens you can include it in a single system, but our brains have multiple systems we can control and that can be combined in multiple ways at any given moment: emotions, feelings, thoughts combined into user space, 3 core systems input, memory, output. The nuances is in the fact that various reasons (nature + nurture), various humans appear to have varying levels of meta control over the multiple reasoning systems.
Why are you pretending to be participating in a debate? You mention things like "moving the goalpost", "counter[arguments]", and "arguments", as if you did anything more than just assert your opinion in the first place.
This is what you wrote:
> LLMs don't understand.
That's it. An assertion of opinion with nothing else included. I understand it sucks when people feel otherwise, but that's just kinda how this goes. And before you bring up how there were more sentences in your comment, I'd say they are squarely irrelevant, but sure, let's review those too:
> It's mind-boggling to me that large parts of the tech industry think that.
This is just a personal reporting of your own feelings. Zero argumentational value.
> Don't ascribe to them what they don't have.
A call for action, combined with the same assertion of opinion as before, just rehashed. Again, zero argumentational value.
> They are fantastic at faking understanding.
Opinion, loaded with the previous assertion of opinion. No value add.
> Don't get me wrong, for many tasks, that's good enough.
More opinion. Still no arguments or verifiable facts presented or referenced. Also a call for action.
> But there is a fundamental limit to what all this can do.
Opinion, and a vague one at that. Still nothing.
> Don't get fooled into believing there isn't.
Call for action + assertion of opinion again. Nope, still nothing.
It's pretty much the type of comment I wish would just get magically filtered out before it ever reached me. Zero substance, maximum emotion, and plenty of opportunities for people to misread your opinions as anything more than that.
Even within your own system of opinions, you provide zero additional clarification why you think what you think. There's literally nothing to counter, as strictly speaking you never actually ended up claiming anything. You just asserted your opinion, in its lonesome.
This is no way to discuss anything, let alone something you or others likely feel strongly about. I've had more engaging, higher quality, and generally more fruitful debates with the models you say don't understand, than anyone here so far could have possibly had with you. Please reconsider.
> higher quality, and generally more fruitful debates with the models you say don't understand
My favorite thing about LLMs is that they can convincingly tell me why I'm wrong or how I could think about things differently, not for ideas on the order of sentences and paragraphs, but on the order of pages.
My second favorite thing is that it is amazingly good at deconstructing manipulative language and power tactics. It is scary good at developing manipulation strategies and inferring believable processes to achieve complex goals.
Had some success with that myself as well. Also found out about Claimify [0] recently, I should really get myself together and get a browser extension going one of these days. I think the quantized gemma3 models should be good enough for this, so it could remain all local too.
> So, it is your opinion that the mere expression of opinion "without anything else" is not allowed in a discussion?
This is not what I said, no: I said that asserting your opinion over others' and then suddenly pretending to be in a debate is "not allowed" (read: is no way to have a proper discussion).
A mere expression of opinion would have been like this:
> [I believe] LLMs don't understand.
And sure, having to stick an explicit "I think / I believe" everywhere is annoying. But it became necessary, when all the other things you had to say continued to omit this magic phrase, and it became clearly intentionally not present, when you started talking as if you made any arguments of your own. Merely expressing your opinion is not what you did, even when reading it charitably. That's my problem.
> Would your own contribution to the discussion pass your own test?
And so yes, I believe it does.
> You might have overlooked that I provided extensive arguments all around in this thread. Please reconsider.
I did consider this. It cannot be established that the person whose comment you took a whole lot of issue with also considered those though, so why would I do so? And so, I didn't, and will not either. Should I change my mind, you'll see me in those subthreads later.
meh. I feel this is just a linguistic shortcut, similar to how _trained_ biologists can talk about a species or organism evolving some trait. Of course the organism isn't _really_ evolving with any goal in mind, but that's clear to the speaker and audience. Whether or not LLMs understand (very unlikely), it's clear what we mean by an LLM "understanding": has the context + prior training to make reasonable predictions. But no one wants to write that each time.
Exactly. The whole point of all the LLM companies is to get grandma to use it. If you say understand about a technology with the desired appeal of Facebook, then you’re talking to everyone and words matter extra hard.
A Person who memorizes something by rote, can pass many tests. From a test and verifiability perspective, they cannot be distinguished from someone who understands a subject.
An LLM can pass many tests, it is indistinguishable from someone who understands the subject.
Indistinguishable does not imply that the processes followed match what a human is doing when it understands a subject.
I use this when I think of humans learning - humans learn the most when they are playing. They try new things, explore ideas and build a mental model of what they are playing with.
To understand something, is to have a mental model of that thing in ones head.
LLMs have models of symbol frequency, and with their compute, are able to pass most tests, simply because they are able to produce chains of symbols that build on each other.
However, similar to rote learning, they are able to pass tests. Not understand. The war is over the utilitarian point “LLMs are capable of passing most tests”, and the factual point “LLMs dont actually understand anything”.
This articulation of the utilitarian point is better than the lazier version which says “LLMs understand”, and this ends up anthropomorphizing a tool, and creating incorrect intuitions of how LLMs work, amongst other citizens and users.
the extraordinary claim would be that LLMs can only do things they've seen before exactly, given the compositional and emergent capabilities we observe. The evidence suggests they can generalize beyond their training in meaningful ways, even if imperfectly...if a human came out living but with a brain that had zero electrical activity, that would be extraordinary, we normally come out with a baseline of pre-programming. I sometimes think this debate happens because humans don't want to admit we're nothing more than LLMs programmed by nature and nurture, human seem to want to be especially special.
>if a human came out living but with a brain that had zero electrical activity, that would be extraordinary, we normally come out with a baseline of pre-programming.
That statement reveals deep deficiencies in your understanding of biological neural networks. "electrical activity" is very different from "pre-programming". Synapses fire all the time, no matter if meaningfully pre-programmed or not. In fact, electrical activity decreases over time in a human brain. So, if anything, programming over time reduces electrical activity (though there is no established causal link).
> I sometimes think this debate happens because humans don't want to admit we're nothing more than LLMs programmed by nature and nurture, human seem to want to be especially special.
It's not specific to humans. But indeed, we don't fully understand how brains of humans, apes, pigs, cats and other animals really work. We have some idea of synapses, but there is still a lot unclear. It's like thinking just because an internal combistion engine is made of atoms, and we mostly know how atom physics and chemistry work, that any body with this basic knowledge of atom physics can understand and even build an ICE. Good luck trying. It's similar with a brain. Yes, synapses play a role. But that doesn't mean a brain is "nothing more than an LLM".
Neural activity begins around 6 weeks gestation, electrical patterns help establish basic neural circuits, activity dependent neural development shapes connectivity before any sensory input, critical periods where electrical activity literally sculpts brain architecture. Motor patterns get programmed before birth (why babies can suck, grasp, etc.), language processing areas develop structural biases before hearing language, visual cortex develops orientation maps before seeing anything, basic learning algorithms get "wired in" through developmental processes. If a human emerged, was able to function in the world, do things, but had zero electrical activity in the brain, that would be... normal? No: extraordinary.
Humans arrive out of the VJJ with innate neural architectures to be filled and developed - not literal blank slates, there is an OS. The electrical activity during development is literally the biological process that creates our "base programming." LLMs have architectural inductive biases (attention mechanisms, etc.), human brains have evolved architectural biases established through fetal development. We're both "pre-programmed" systems, just through different mechanisms.
Your response about "electrical activity decreases over time" is irrelevant - you weren't talking about adult brain activity, you were talking about the developmental process that creates our initial neural architecture.
tbh: I can't tell if you're engaging in good faith or not.
Definition of understanding is based on connecting relations. If there is one thing a llm can do its connecting relations. So I am not sure why you say llms are not understanding.
They understand tho, it's different than how it's done in our brain but they solve task that would be impossible to do without understanding. I would even say that they can now reason through problems thanks to powerful reasoning models like Gemini 2.5 Pro and o3.
You are asking why it is meaningful to use terms for what they mean instead of making up things?
Well, I prefer it that way, but the spirit of "AI" seems to go in another direction, and the leadership of US government also does, so maybe times are just changing.
Thanks!!! I decided not to build it, that space is already too busy, there is a startup with $25MM in stealth, who else is in stealth? On top of that, this method will get stale very very quickly, foundation model businesses are just too hard to work around right now, it's a silly way to do business. My magic is I've build a startup from scratch to over 400 people and watched what they do, it won't be long till that isn't worth much.
I’ve been floating around a similar set of ideas and it’s been very fun (if not all that useful yet) to build
Did you try taking it one step further where a “recruiter” has to hire the engineers after a screening process? I wonder if this could get you even better AI engineers
Sounds really interesting but I have no idea what’s the purpose of having 175 “employees” here? Maybe it is a smart way to sell the idea you’re going to replace 175 people if you buy the product? Could just buy chatgpt instead I guess, but a chatbot doesn’t sound as cool as 175 employees.
Goods and services are a byproduct of business, business is primarily concerned with systems and processes that facilitate value exchange, so my tool, can work with a user, to build a business, not a product or a service. If you bake cupcakes, my tool can get you 100 people at your door, it cannot open the door or provide the cakes.
The only investor is me. I build it on my own over a weekend, on my own. I just wanted to confirm it can be done therefore will exist, that is all. Personally, I decided not to peruse it because I am old and lazy and don't want to compete against a16z and sequoia funded adderall filled teenagers.
It professes to be able to do the business side of business (not literal product or technology development) - they did not have any agents to code or design, mine didn't either but has agents that can call tools, I don't think mine can build a product, but I believe mine can build, operate, and grow a business around one, I presume theirs will be able to also.
Your message doesn't make it clear what those 175 employees can realistically accomplish on their own.
For instance, you might have an SEO expert on the team, but that alone won't guarantee top search engine rankings. There are countless SEO professionals and tools (human or AI-powered), and even having the best one doesn't eliminate the underlying challenge: business competition. LLMs, like any other tool, don’t solve that fundamental problem.
No employees accomplish anything on their own in the real world, all employees are part of a team. That's why I designed a business strategy and analysis layer (over half the system, in fact), with web tools and connections to all of the insights systems (like mix panel). I built the exact same thing I build at digitalocean but instead of humans I defined them with code, digitalocean runs just fine, so does my LLM system. The whole system I build is self learning, insight gathering and refinement. Competition is for losers, the best teams win via the best insights.
Why 175? Why not 5 billion employees? Why not 20000 companies in parallel? Why not simulate 5 earth's worth of history and setup a full universe of worlds full of startups?
This sounds like those guys in social media that one up each other with their bed times and end up saying they wake up every day at 2am to meditate and work out
Because that was the scope of the project. When we got to 400 employees at DigitalOcean I noticed I thought it was really half that, original I just sat out to make the marketing and strategy team, but got bit carried away, the fp&a team was the only group I really struggled with, my cfo skills are very meh.
1 single agent with a good model is going to beat that approach every single time. The same way Whatsapp needed only 55 people (and probably the last hires were not needed for the outcome) to sell for $19B.
And other companies have existed for hundreds of years and had thousands of people work for them and never even made $100M.
I'm confused what you're saying. There a loads of markets, loads of segments, loads of ways to do unit economics, yes, but business is business, it's prescriptive at it's core. I'm using a single model, it's just openai calls using the role function.
This really sounds like a “faster horse” scenario and totally misses the point of the GPs comment: why shackle yourself to modeling the way humans work?
> forking itself into as many sub-agents as are needed to fulfill the task
The forking is free. Running the sub-agents is linear cost, but the expensive bit is joining the agents responses back together again.
If a task has 6 subtasks and an agent is spawned for each, at some point some 'joiner' agent needs to parse and summarize the findings of the sub agents and feed it back to the parent. That step necessarily involves information loss, and uses more computation that a single linear agent design would not use.
I designed something for a business and found I needed 4 major sub-systems (like a real business) - insight/data, cognition, meta cognition and execution, and if you don't define all 4, the system is junk.
> I designed something for a business and found I needed 4 major sub-systems (like a real business) - insight/data, cognition, meta cognition and execution, and if you don't define all 4, the system is junk.
Might it be just another realization of Conway's law?
Might it be possible that the only reason you're assuming a system is junk is just that it doesn't resemble the systems you know and expect? There are so many ways to skin a cat, and certainly no business process represents the optimal process.
I thought about this for 15 hours. First, I really appreciate how you wrote this comment. It's an extremely well composed comment. I also appreciate your use of the word might. Anyway, I suspect you are probably correct, and I hit the wall again because: I'm just too versed.
> If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious.
And this is precisely how really bad things could happen:
The challenge with fan out is constructing a linear conversation that makes sense that captures previous history. In any context where the LLM needs that information linear loops often perform better than trying to splice together conversations from multiple parallel processes.
This is similar to something we've been doing for a while. Instead of individual agents we are creating many iterations and sub-iterations of spawned agents that are largely autonomous. A lot of the human-centric paradigms just don't really apply to LLMs/AI but people are used to approaching them that way.
"FP32 is less common in modern ML workloads and often less optimized on recent hardware compared to FP16 or BF16, which may partly explain why it’s easier to achieve performance gains over PyTorch with FP32 kernels."
People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.
I believe that these good results are explained at least in part by the fact that NVIDIA does not provide detailed enough documentation for their GPUs.
For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program, it is much less likely that applying ML/AI can be successful, except as a substitute for searching already known solutions.
On the other hand, for less documented microarchitectures, like of the NVIDIA GPUs, finding an optimal program may be impossible other than by doing a random search guided by examples of previous optimized programs, and possibly doing some reverse-engineering work to determine the real behavior of the GPU in some circumstances.
Improving over something like this is likely to be feasible for ML/AI, where training over known good programs may be able to extract some of the undocumented behavior that may be non-obvious for humans reading those examples.
> For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program
We don't even know the optimal algorithms! AlphaEvolve recently found "an algorithm to multiply 4x4 complex-valued matrices using 48 scalar multiplications, improving upon Strassen’s 1969 algorithm that was previously known as the best in this setting." - https://www.nature.com/articles/s41586-022-05172-4
I was not referring to algorithm designing, but to algorithm implementation on a given processor, which consists in things like register allocation, machine instruction selection and static instruction scheduling.
> For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program
You severely underestimate the landscape of possible implementations for these kernels. There are many ways of performing a matrix multiplication and predicting which one will perform best without running them all is nontrivial, even with perfect knowledge of the underlying system.
This is just a completely incorrect take, speaking as a former insider.
If you cannot predict the running time of an algorithm without running it on the target processor, then by definition the documentation of that processor is incomplete.
For a processor that is completely documented one must be able to run a simulation model that provides the running time for a given program, when the processor is so complex that simpler methods for computing the execution time do not work.
For older NVIDIA GPUs, there exist such simulation models, but they are only partially accurate, because they are based on reverse engineering, without cooperation from the GPU vendor.
The point being made is that in a production environment you cant't run/simulate all the possible candidate implementations to find out the fastest one -- it would take far longer than just choosing one at random. Therefore, you need an algorithmic way of picking a good candidate out of the many you have, and you can't take forever to make that selection either, because the clock is ticking the moment you receive a request to run that matrix multiply.
You can't precompute all the possible options in advance and fetch the running time from a database either, because the parameter space is just way too huge.
Notice that none of this has anything to do with having accurate models of the system. This is what people who do this for a living and have perfect knowledge of the system choose to do, for good reasons.
Nobody writing high-performance code for these machines has that documentation. They largely do ok anyway because the time of cycle counting is long in the past and the name of the game involves cache effects and synchronization that is very hard to reason about individually but clearly visible if taken in aggregate. You don't get a cookie if you can accurately time 10 instructions, but you do if your matrix multiply over a hundred million of them is 1% faster.
While it is decidable, people typically never produce optimal programs even for the hot path. It is just intractable and too slow to do right now.
For register allocation and instruction selection, there is hope because it is FPT and there are algorithms to do it optimally in polynomial time, albeit with a large constant factor (FPT), making it impractical to apply to compilers as of today. For instruction scheduling, it is just too hard. If you read literature on scheduling algorithms, it is NP-hard for apparently simple instances, e.g., 2 parallel identical machines with no preemption and bounding completion time (https://www2.informatik.uni-osnabrueck.de/knust/class/), while actual microarchitecture is much more complicated than this...
Needless to say, these are already the simpler problems. The longer the program or the more profiling data you can optimize for, the more tricks you can throw at it, and most of them are NP-hard to optimize optimally.
Being NP-hard doesn't imply that you can't obtain the optimal result, but compilers that I know of do not implement them, because most users are not willing to wait for days for such a compilation to complete. Ideally, one should make something that can run on clusters of CPUs or GPUs to optimize this, and people having those clusters will typically be willing to do this because they want to optimize the program they later run on the clusters. However, to my knowledge, no one is working on this at the moment.
While you are right that in general it may be very difficult or quasi-impossible to generate optimal programs, there are also plenty of cases where an optimal program is achievable.
The latter happens when there is one dominant bottleneck for the algorithm, which is determined by the hardware, e.g. the maximum throughput of a certain instruction, like multiplication or memory load. When the implemented program reaches a throughput almost equal with that absolute limit, then one can be certain of its optimality.
Matrix multiplies are typically compute bound, but you don't get much option to improve the actual algorithm because Nvidia gives you an accelerator for one and anything else would be slower.
While I think the OP did not mean the compilation process is nondeterministic, I won't be surprised if it is actually non-deterministic.
A lot of algorithms and data structures rely on nondeterminism for performance or security (by default). It is too easy to introduce nondeterminism accidentally, and it is tempting to use that to speed up algorithms.
Also, if it relies on floating point, results on different machines and environments may be different (depending on libm and hardware implementation), which is, in some sense, nondeterministic.
> A lot of algorithms and data structures rely on nondeterminism for performance or security (by default). It is too easy to introduce nondeterminism accidentally.
You don't know what you're talking about - no compiler engineer in their right mind would intentionally use a randomized algorithm in a compiler. It's a bug every single time and it gets squashed immediately.
The running time of a CUDA kernel is apparently impossible to determine except by experiment and measurement, and might be nondeterministic. By contrast for a more typical CPU, there's a compiler whose assembly output you can examine, and there's a processor manual that gives the cycle timing of each instruction. So you can compute the running time at least of inner loops that stay in cache, and that sort of thing.
> By contrast for a more typical CPU, there's a compiler whose assembly output you can examine, and there's a processor manual that gives the cycle timing of each instruction
1. You can dump the SASS that corresponds to PTX: `cuobjdump --dump-sass <input_file>`
2. Getting the cycle count of a single instruction for an OOO architecture is completely meaningless because you have no idea when the instruction will actually be issued. This is true for both AMDGPU and NV.
No one said the AI created new algorithms nor that there weren’t pre-existing solutions.
The implication was that the FP32 versions of these kernels have lagged behind the more popular versions. There was opportunity to translate the advancements from other kernels into these. Someone would need to look closely to see exactly what was done, but it’s premature to suggest anything like “new algos” or “no pre-existing solutions”
This is a great use case for LLMs, though. I often do something similar where I make improvements to something I use most frequently and ask an LLM to translate that pattern to other similar parts of the code.
> Does that mean optimized FP32 versions of these kernels were already there or not?
If you're trying to support your original point with that argument, then you're using some pretty awful definitions of the terms "new algos" and "no pre-existing solutions".
> Help me understand this 'cause I'm a bit slow these days ...
If I do `sed 's/f32/f16/g' kernel.cu` does this count as AI? Help me understand because I'm a little slow when it comes to all the dumb shit people attribute to LLMs these days...
The solution not existing in PyTorch does not mean the solution doesn’t exist elsewhere on the internet. Remember - PyTorch is largely maintained by employees of companies that have their own priorities for the SW and those priorities may not include hyper optimizing fp32 kernels.
That being said, it is cool if AI is enabling lower cost adoption of better more optimized kernels with less effort.
Read the article before spouting lies. Actually never mind that.
Read the damn comment you're responding to. There have been human written kernels for both fp16 and fp32 for a long time.
Here is the corrected version of your comment:
"Wow, so, you're basically saying the AI created the same but faster algos in a well known domain with established pre-existing solutions, whose overall impact on the runtime of practical workloads is insignificant? Awesome!"
My takeaway - from this article, from Google’s AlphaEvolve [1], and the recent announcement about o3 finding a zero day in the Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular have reached a new level of capability where these ideas that were tried unsuccessfully with other models, suddenly just work.
In my opinion, I wouldn’t say so much that they are suddenly working. Rather we’ve reached a point where they can iterate and test significantly faster than humans are capable of doing and have the ability to call on significantly more immediately available information that it can make sense of, and as a result, the combination information, advancement and intelligently applied brute force seems to be having success in certain applications.
Good points. I suspect that o3 is able to reason more deeply about different paths through a codebase than earlier models, though, which might make it better at this kind of work in particular.
I was blown away by some debugging results I got from o3 early on and have been using it heavily since. The early results that caught my attention were from a couple cases where it tracked down some problematic cause through several indirect layers of effects in a way where you'd typically be tediously tracing step-by-step through a debugger. I think whatever's behind this capability has some overlap with really solid work it'll do in abstract system design, particularly in having it think through distant implications of design choices.
The main trick is in how you build up it's context for the problem. What I do is think of it like a colleague I'm trying to explain the bug to: the overall structure is conversational, but I interleave both relevant source chunks and detailed/complete observational info from what I've observed about anomalous program behavior. I typically will send a first message building up context about the program/source, and then build up the narrative context for particular bug in second message. This sets it up with basically perfect context to infer the problem, and sets you up for easy reuse: you can back up, clear that second message and ask something else, reusing detailed program context given by the first message.
Using it on the architectural side you can follow a similar procedure but instead of describing a bug you're describing architectural revisions you've gone through, what your experience with each was, what your objectives with a potential refactor are, where your thinking's at as far as candidate reformulations, and so on. Then finish with a question that doesn't overly constrain the model; you might retry from that conversation/context point with a few variants, e.g.: "what are your thoughts on all this?" or "can you think of better primitives to express the system through?"
I think there are two key points to doing this effectively:
1) Give it full, detailed context with nothing superfluous, and express it within the narrative of your real world situation.
2) Be careful not to "over-prescribe" what it says back to you. They are very "genie-like" where it'll often give exactly what you ask for in a rather literal sense, in incredibly dumb-seeming ways if you're not careful.
In the context of LLMs, what do you mean by "reason"? What does reasoning look like in LLMs and how do you recognize it, and more importantly, how do you invoke it? I haven't had much success in getting LLMs to solve, well, basically any problem that involves logic.
Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".
As best as I have understood, the LLMs output is directly related to the state of the network as a result of the context. Thinking is the way we use intermediate predictions to help steer the network toward a what is expected to be a better result through learned patterns. Reasoning are strategies for shaping that process to produce even more accurate output, generally having a cumulative effect on the accuracy of predictions.
It doesn’t? Reasoning is not an analysis; it is the application of learned patterns for a given set of parameters that results in higher accuracy.
Permit my likely inaccurate illustration:
You’re pretty sure 2 + 2 is 4, but there are several questions you could ask: are any of the numbers negative, are they decimals, were any numbers left out? Most of those questions are things you’ve learned to ask automatically, without thinking about it, because you know they’re important. But because the answer matters, you check your work by writing out the equation. Then, maybe you verify it with more math; 4 ÷ 2 = 2. Now you’re more confident the answer is right.
An LLM doesn’t understand math per se. If you type “2 + 2 =”, the model isn’t doing math… it’s predicting that “4” is the next most likely token based on patterns in its training data.
“Thinking” in an LLM is like the model shifting mode and it starts generating a list of question-and-answer pairs. These are again the next most likely tokens based on the whole context so far. “Reasoning” is above that: a controlling pattern that steers those question-and-answer sequences, injecting logic to help guide the model toward a hopefully more correct next token.
Very likely. Larger context is significantly beneficial to the LLMs when they can maintain attention, which was part of my point. Imagine being able to hold the word for word text of your required reading book while you are taking a test, while older models were more like a couple chapters worth of text. Two years ago.
It’s true that there are similarities between what you mentioned and what’s happening in this case. From the article:
> The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation.
My conclusion would be that we’ve now learned to apply LLMs’ capabilities to shrink solution space where we have a clear evaluation function as well as solutions to problems that might follow similar patterns. This applies in this case as well.
IMO, It’s not about model X gaining on other models or model Y being able to reason about the solutions, etc. in a way that other models couldn’t.
Gemini Pro 2.5 is the first AI that I can productively use for anything other than human language translation, but it's just barely crossed that threshold. Sometimes I get success hit rates below 20%.
When 3.0 comes out, that... that's going to start getting a little scary.
SRE / DevOps / coding mostly in the Azure and .NET ecosystems.
The problems I have to solve tend to be the horrible ones that nobody has answers to, anywhere on the Internet, so unsurprisingly the AIs aren't good at it either.
The trick has been to use the AIs for what they are good that, which used to be "nothing" for me at least, but now I can use them productively for certain "spot" tasks.
Random examples:
- Cross-language and cross-platform benchmarking of a bunch of different database clients to see how they stack up. I gave the AI a working example in one language and got it to whip up a series of equivalents with other DB drivers and languages. Sure, it's trivial, but it's way faster than doing it myself!
- Crash dump analysis using WinDbg. I read somwhere that "vibe debugging" of kernel dumps totally works, so when I had an actual crash I gave it a go for laughs. With AI help I managed to extract the name of the specific file that had NTFS corruption and was crashing the server. Deleted the file, restored it from backups, and the server was good to go again!
- If you ever watch the top mechanical engineers on YouTube, they all make their own tools instead of just buying them. Jigs, extenders, unusual sizes, etc... IT work is the same. As a recent example, I got Gemini to make me a code-AST rewriter for a specific issue I wanted to clean up in bulk across a huge code base. Using the Roslyn compiler SDK is a bit fiddly, but it spat out a working tool for me in under an hour. (This is not something you can solve with a script full of regex, it needed a proper parser to handle commented-out blocks and the like.)
> Sure, it's trivial, but it's way faster than doing it myself
That's the clincher for me. So much software work is just excecuting on a design, not inventing anything new. Being able to do 5x the trivial work in an hour is life changing, and it lets me pull my head out of that work to see how I can make larger process improvements. AI doesn't need to rewrite the linux kernel in Rust to be extremely valuable to the average developer
Sounds like interesting work, thanks for sharing! "Vibe debugging", hah, I like that one. The latest crop of models is definately unlocking new capabilities, and I totally get the desire to make your own tools. I do that to a fault sometimes, but it's nice to have a simple tool that does exactly one thing, exactly the way you want it.
I've been pair programming with the models for a while, and wrote some "agents" before I knew to call it that back in the dark days of GPT-3.5, but only recently with the latest models unlocking capabilities beyond what I could achieve with handwritten code.
Wait, what are you saying? These have nothing to do with the Linux kernel whatsoever, they are "kernels" in the GPU programming sense. Did you just hallucinate this whole comment or what?
Theres zero days in obscure parts of the kernel nobody uses every other day. (It also of course found 100 other things that were not zero days or vulnerabilities, yet professed they were, which is why this trash even on Gemini 9000 Pro keeps spamming security mails)
There was a post on HN a bit ago from someone who used o3 to find a vulnerability in the Linux kernel's SMB server, which this person is just saying should've been tried earlier and probably recently became possible
yeah, it seems likely the underlying task here (one reasoning step away) was: replace as many fp32 operations as possible in this kernel with fp16. i'm not sure exactly how challenging a port like that is, but intuitively seems a bit less impressive
maybe this intuition is wrong but would be great for the work to address it explicitly if so!
Why do you think it is a huge tolerance ? (Just curious since it is not clear to me if that will lead to too much of reduction in numerical accuracy compared to the speedup)
The point is, this amount of error is huge for fp32, but may be expected for fp16. But then why compare to fp32 performance baselines? An algorithm that gives you the accuracy of fp16 should be compared to an fp16 baseline, and this may not be (it probably is not) a speedup at all, it's likely much slower.
This means the results are useless. Did they even check the relative error at all?
Replacing float32 operations with float16 is also pointless. There is nothing to be gained by doing this, as it removes the actual accuracy advantage of float32s, which would the single most important reason to use that version of the algorithm.
I think this error is large enough that referring to it as FP32 is misleading.
Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.
But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.
By far the most interesting part (after the 400% speed up in some cases) is the methodology: rather than hill climb on operations, they forced a language reasoning step between iterations to encourage diversity of search. This seems to have worked. Very very interesting.
Just anecdotally I feel like hill climbing on operations is just so slow; I’m not saying it doesn’t work, but it always feels one step away from brute force search. I really like the idea of just throwing stuff at the LLM and giving it access to old strong variants in context.
Tried a replication here. The LayerNorm kernel is not numerically stable so cannot be counted as valid. They only test with zero mean and unit std, so the catastrophic cancellation doesn't show up until after.
EDIT: looks like they've since generated another one that is numerically stable! great work
Beating pytorch and tensorflow kernels has been easy to do with ml compilers since ~2018. You typically train and evaluate your model in one of these frameworks then hand off the computation graph to a compiler like Apache TVM or your hardware vendor’s proprietary one. They should test their kernels against those kernels.
ML guided heuristic search over compute schedules is as old as 2013 (Halide for image processing)
> They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.
The PyTorch code base is NOT written by performance experts in any way. This is the wrong base line. Nothing about that code base is clean or hand optimized
.
The "AI" generation methodology seems to give many instructions and even descends into instruction trees, manually throwing away results etc. So it requires, as usual, extreme guidance.
Very fascinating result, and it seems they wrote this blog post out of pure excitement to share their findings, and maybe to have someone throw cold water on it before publishing, ha.
Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.
Each time you define another task well enough for the system to work, you generalize the system just a little bit - repeat enough times and you can start to expand, develop taxonomies of functions, precisely define function spaces and metrics for improvement. This might not be a bootstrap for recursive self improvement generally, but it could definitely inform the theory or design of a system that does bootstrap rsi.
The structure of their research - the process, the specific task, and the data they generate - will help inform how other research gets performed. Instead of GPU kernels, maybe the next task is something like neuron modules, looking for structures that improve on attention blocks, or things like that - each time you run through an experiment like this, you're creating foundational data upon which other experiments can be run and improved. Once you've done enough of them, you can generalize.
It could be that the end result is the knowledge of strict boundaries of LLM capabilities, that they can only operate in specific domains, or only improve to a certain extent, and some currently unspecified defect limits the level of improvement.
The underlying idea of specifying a domain and task conditions, then letting an LLM run thousands of experiments, is a great search technique. The hope is that there is no implicit defect and that the methodology will extend and generalize - it's not too complex a notion to think that you could have an LLM create a broad range of individual tasks, with a meta-goal of identifying better and more general recursive improvement processes and algorithms.
>The hope is that there is no implicit defect and that the methodology will extend and generalize - it's not too complex a notion to think that you could have an LLM create a broad range of individual tasks, with a meta-goal of identifying better and more general recursive improvement processes and algorithms
Again, entirely different idea that doesn't have a straightforward evaluation function. As it stands, this is more akin to genetic programming with a very good mutation function.
what's going to be interesting is to see the large space of fused kernels being tackled by AI generated code. that might include gemm + relu + gemm + a norm of some kind - which would be annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite as a human
A function that is meant to be executed in parallel on an attached GPU is called a kernel. In CUDA, a kernel is usually identified by the presence of the __global__ specifier in front of an otherwise normal-looking C++ function declaration.
This sounds more like using AI (llm) as one small step, where the randomness in the output is used to implement a Genetic Algorithm, than being "AI-generated" (admittedly technically correct).
Sometimes I think of LLMs as kind of a hive mind. It’s trained on thought processes of so many humans. I think that’s why it’s able to do these kinds of things given the fact that it has so much information and context compressed in weights.
I was thinking this was about leaking the kernels or something, but no, they are
"publishing" them in the sense of putting out the blog post - they just mean they are skipping the peer review process and not doing a formal paper.
Disclaimer: This used to be my bread and butter, but I'm really rusty after five years of not working on this sort of stuff.
That said, after quickly skimming the example AI-generated kernel I am not seeing anything novel there. While working at nVidia I did see a handful of techniques that, frankly, blew my mind.
Thus, I wonder what makes this AI-generated kernel faster than the standard pyTorch kernel, which I presume is simply delegating all the heavy lifting onto cuDNN. My guess, and it's just a guess, is that they are comparing the fastest AI-generated kernel they produced for a very particular set of parameters against whatever kernel cuDNN is picking for that same scenario, and perhaps the subsystem inside cuDNN that picks which kernel to execute out of the very large database it manages chose a suboptimal candidate. Researchers tend to completely ignore this issue and assume that cuDNN is always able to choose the very best kernel in every possible scenario, something that is just not realistic.
Maybe there is something else going on, but these sort of "we have beaten this heavily optimized proprietary library" always seem to miss this very important point.
Kind regards to any NVidia insiders who may read this. You guys are the brightest people I've ever met.
Thanks for the link. I did have a quick look at the github repo yesterday, trying to find where this "default pytorch implementation" lives, but it wasn't immediately apparent due to the layers of abstraction in the code.
There's little doubt in my mind that the authors of the blog post went after the largest performance delta between their best AI-generated code and the default pytorch implementation, without looking into real-world nuances like "picking a good kernel out of a zillion options" or "production-quality output accuracy". It's cool that this sort of thing is being researched, but the results need to be taken with a grain of salt. Maybe I'm too jaded.
>and test for correctness by checking the numerical equality of the two outputs over many random inputs.
This is fundamentally different to how any human would approach this problem. And also different to how some recent advances in this area were made, where AI actually came up with superior and correct algorithms.
This approach also seems quite unfortunate and makes many of theses results somewhat doubtful.
IIRC there was another paper recently, with similar methodology about computing xAx. These papers produce algorithms which aren't empirically correct, but provably correct. They do this by operating on a graph data structure, which describes the algorithm and then verifying the algebraic equality to the correct result.
There is a substantial difference here. And I think utilizing algorithms which only are empirically correct can be dangerous.
Most people think of agents like they think of human employees. They set up a limited number of agents to run in parallel (often just one), with each agent running in a loop and doing one task at a time. They're still in a world where you have a fixed (on the timescale of hours or days) number of employees, each employee can only do one thing at a time, and transferring tasks between employees is slow and costly.
LLMs don't really work like that. You effectively have an infinite number of agents that you can conjure out of thin air at any time. There's no cost advantage to performing LLM requests in series rather than in parallel.
If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious. This is exactly what the authors have done.
I think a better way to think of agents is as "tasks" or "jobs", like those you might find in Celery or sidekik, and apply the learnings from those.
LLMs don't understand. It's mind-boggling to me that large parts of the tech industry think that.
Don't ascribe to them what they don't have. They are fantastic at faking understanding. Don't get me wrong, for many tasks, that's good enough. But there is a fundamental limit to what all this can do. Don't get fooled into believing there isn't.
I think you might be tied to a definition of "understanding" that doesn't really apply.
If you prompt a LLM with ambiguous instructions, it requests you to clarify (i.e., extend prompt to provide more context) and once you do the LLM outputs something that exactly meets the goals of the initial prompt, does it count as understanding?
If it walks like a duck and quacks like a duck, it's a duck,or something so close to a duck that we'd be better off calling it that.
It does not understand that it needs clarification. This behavior is replicated pattern
Your question can be rephrased to “what would an actual difference look like.”
However, what you are asking underneath that, is a mix of “what is the difference” and “what is the PRACTICAL difference in terms of output”
Or in other words, if the output looks like what someone with understanding would say, how is it meaningfully different.
—-
Humans have a complex model of the world underlying their thinking. When I am explaining this to you, you are (hopefully) not just learning how to imitate my words. You are figuring out how to actually build a model of an LLM, that creates intuitions / predictions of its behavior.
In analogy terms, learning from this conversation, (understanding) is to create a bunch of LEGO blocks in your head, which you can then reuse and rebuild according to the rules of LEGO.
One of the intuitions is that humans can hallucinate, because they can have a version of reality in their head which they know is accurate and predicts physical reality, but they can be sick/ill and end up translating their sensory input as indicating a reality that doesn’t exist. OR they can lie.
Hallucinations are a good transition point to move back to LLMs, because LLMs cannot actually hallucinate, or lie. They are always “perceiving” their mathematical reality, and always faithfully producing outputs.
If we are to anthropomorphize it back to our starting point about “LLMs understand”, this means that even when LLMs “hallucinate” or “lie”, they are actually being faithful and honest, because they are not representing an alternate reality. They are actually precisely returning the values based on the previous values input into the system.
“LLMs understand” is misleading, and trojans in a concept of truth (therefore untruth) and other intuitions that are invalid.
—-
However, understanding this does not necessarily change how you use the LLMs 90% of the time, it just changes how you model them in your head, resulting in a higher match between observer reality and your predictive reality.
For an LLM this makes not difference, because its forecasting the next words the same way.
(Ironically, I would not be too surprised if it was produced by an LLM.)
* replying with “I don’t know” a lot more often
* consistent responses based on the accessible corpus
* far fewer errors (hallucinations)
* being able to beat Pokémon reliably and in a decent time frame without any assistance or prior knowledge about the game or gaming in general (Gemini 2.5 Pro had too much help)
In the first prompt the replicated pattern is to ask for clarification, in the second prompt the replicated pattern is to perform the work. The machine might understand nothing but does it matter when it responds appropriately to the different cases?
I don't really care whether it understands anything at all, I care that the machine behaves as though it did have understanding.
No. You have an initial prompt that is vague, and then you have another prompt that is more specific.
- "draw me an automobile"
- "here's a picture of an ambulance."
- "could you make it a convertible instead? Perhaps green."
- "ok, here's a picture of a jaguar e-type".
Saying “LLMs match understanding well enough”, is to make the same core error if we were to say “rote learning is good enough” in a conversation about understanding a subject.
The issue is that they can pass the test(s), but they dont understand the work. This is the issue with a purely utilitarian measure of output.
https://en.m.wikipedia.org/wiki/Chinese_room
Argumentum ad populum, I have the impression that most computer scientists, at least, do not find Searle's argument at all convincing. Too many people for whom GEB was a formative book.
Yes. If you give models that have a cutoff of 2024 the documentation for a programming language written in 2025 it is able to write code in that language.
Grammars are an attempt at describing a language. A broken attempt if you ask me. Humans also don't like them.
For formal languages, which programming languages (and related ones like query languages, markup languages, etc) are an instance of, the grammar defines the language. It come first, examples second.
Historically, computers were very good at formal languages. With LLMs we are entering a new age where machines are becoming terrible at something they once excelled at.
Have you lately tried asking Google whether it's 2025? The very first data keeping machines (clocks) were also pretty unreliable at that. Full circle I guess.
YES! Sometimes. You’ll often hear the term “zero-shot generation”, meaning creating something new given zero examples, this is something many modern models are capable of.
Neither does your average human. What's your point?
But being better than an average human is usually one of the higher bars.
Autocomplete/intellisense in an IDE is probably the most salient example. An autocomplete that performs _as well_ as the average programmer is, well, totally useless.
Of course it can. It will experiment and learn just like humans do.
Hacker news people still think LLMs are just some statistical model guessing things.
That's exactly what they are. It's the definition of what they are. If you are talking about something that is doing something else, then it's not an LLM.
The power of LLMs are the emergent properties that have emerged which weren't specifically taught to the model. That is not a property that exists in statistical, largely more deterministic models.
If you think it's just a giant token prediction machine you've just ignored the last 5 years.
this is so reductive it's almost not even worth talking about. you can prove yourself wrong within 30 minutes but you choose not to.
Humans also don't understand and are frequently faking understanding, which for many tasks is good enough. There are fundamental limits to what humans can do.
The AI of a few months ago before OpenAI's sycophancy was quite impressive, less so now which means it is being artificially stunted so more can be charged later. It means privately it is much better than what is public. I can't say it "understands," but I can say it outclasses many many humans. There are already numbers of tasks based around understanding where I would already choose an LLM over a human.
It's worth looking at bloom's taxonomy (https://en.wikipedia.org/wiki/Bloom%27s_taxonomy): In the 2001 revised edition of Bloom's taxonomy, the levels were renamed and reordered: Remember, Understand, Apply, Analyze, Evaluate, and Create. In my opinion it is at least human competitive for everything but create.
I used to be very bearish on AI, but if you haven't had a "wow" moment when using one, then I don't think you've tried to explore what it can do or tested it's limits with your own special expertise/domain knowledge, or if you have then I'm not sure we're using the same LLMs. Then compare that experience to normal people, not your peer groups. Compare an LLM to people into astrology, crystal healing, or homeopathy and ask which has more "understanding."
Does that actually matter? Probably not for many everyday tasks...
The claim was LLMs understand things.
The counter was, nope, they don't. They can fake it well though.
Your argument now is, well humans also often fake it. Kinda implying that it means it's ok to claim that LLMs have understanding?
They may outclass people in a bunch of things. That's great! My pocket calculator 20 years also did, and it's also great. Neither understands what they are doing though.
This is what you wrote:
> LLMs don't understand.
That's it. An assertion of opinion with nothing else included. I understand it sucks when people feel otherwise, but that's just kinda how this goes. And before you bring up how there were more sentences in your comment, I'd say they are squarely irrelevant, but sure, let's review those too:
> It's mind-boggling to me that large parts of the tech industry think that.
This is just a personal reporting of your own feelings. Zero argumentational value.
> Don't ascribe to them what they don't have.
A call for action, combined with the same assertion of opinion as before, just rehashed. Again, zero argumentational value.
> They are fantastic at faking understanding.
Opinion, loaded with the previous assertion of opinion. No value add.
> Don't get me wrong, for many tasks, that's good enough.
More opinion. Still no arguments or verifiable facts presented or referenced. Also a call for action.
> But there is a fundamental limit to what all this can do.
Opinion, and a vague one at that. Still nothing.
> Don't get fooled into believing there isn't.
Call for action + assertion of opinion again. Nope, still nothing.
It's pretty much the type of comment I wish would just get magically filtered out before it ever reached me. Zero substance, maximum emotion, and plenty of opportunities for people to misread your opinions as anything more than that.
Even within your own system of opinions, you provide zero additional clarification why you think what you think. There's literally nothing to counter, as strictly speaking you never actually ended up claiming anything. You just asserted your opinion, in its lonesome.
This is no way to discuss anything, let alone something you or others likely feel strongly about. I've had more engaging, higher quality, and generally more fruitful debates with the models you say don't understand, than anyone here so far could have possibly had with you. Please reconsider.
My favorite thing about LLMs is that they can convincingly tell me why I'm wrong or how I could think about things differently, not for ideas on the order of sentences and paragraphs, but on the order of pages.
My second favorite thing is that it is amazingly good at deconstructing manipulative language and power tactics. It is scary good at developing manipulation strategies and inferring believable processes to achieve complex goals.
[0] https://youtu.be/WTs-Ipt0k-M
And if that is so, didn't you also "just" express an opinion? Would your own contribution to the discussion pass your own test?
You might have overlooked that I provided extensive arguments all around in this thread. Please reconsider.
This is not what I said, no: I said that asserting your opinion over others' and then suddenly pretending to be in a debate is "not allowed" (read: is no way to have a proper discussion).
A mere expression of opinion would have been like this:
> [I believe] LLMs don't understand.
And sure, having to stick an explicit "I think / I believe" everywhere is annoying. But it became necessary, when all the other things you had to say continued to omit this magic phrase, and it became clearly intentionally not present, when you started talking as if you made any arguments of your own. Merely expressing your opinion is not what you did, even when reading it charitably. That's my problem.
> Would your own contribution to the discussion pass your own test?
And so yes, I believe it does.
> You might have overlooked that I provided extensive arguments all around in this thread. Please reconsider.
I did consider this. It cannot be established that the person whose comment you took a whole lot of issue with also considered those though, so why would I do so? And so, I didn't, and will not either. Should I change my mind, you'll see me in those subthreads later.
But I'm afraid that most folks using the term mean it more literally than you describe.
(Besides, we know what LLMs do, and none of those things indicate understanding. Just statistics.)
You can explain this to an LLM
The LLM can then play the game following the rules
How can you say it hasn't understood the game?
An LLM can pass many tests, it is indistinguishable from someone who understands the subject.
Indistinguishable does not imply that the processes followed match what a human is doing when it understands a subject.
I use this when I think of humans learning - humans learn the most when they are playing. They try new things, explore ideas and build a mental model of what they are playing with.
To understand something, is to have a mental model of that thing in ones head.
LLMs have models of symbol frequency, and with their compute, are able to pass most tests, simply because they are able to produce chains of symbols that build on each other.
However, similar to rote learning, they are able to pass tests. Not understand. The war is over the utilitarian point “LLMs are capable of passing most tests”, and the factual point “LLMs dont actually understand anything”.
This articulation of the utilitarian point is better than the lazier version which says “LLMs understand”, and this ends up anthropomorphizing a tool, and creating incorrect intuitions of how LLMs work, amongst other citizens and users.
Claiming anything else requires a proof.
https://arxiv.org/abs/2206.07682
https://towardsdatascience.com/enhanced-large-language-model...
https://arxiv.org/abs/2308.00304
(and if MoRA is moving the goal posts, fine: RL/RT)
That statement reveals deep deficiencies in your understanding of biological neural networks. "electrical activity" is very different from "pre-programming". Synapses fire all the time, no matter if meaningfully pre-programmed or not. In fact, electrical activity decreases over time in a human brain. So, if anything, programming over time reduces electrical activity (though there is no established causal link).
> I sometimes think this debate happens because humans don't want to admit we're nothing more than LLMs programmed by nature and nurture, human seem to want to be especially special.
It's not specific to humans. But indeed, we don't fully understand how brains of humans, apes, pigs, cats and other animals really work. We have some idea of synapses, but there is still a lot unclear. It's like thinking just because an internal combistion engine is made of atoms, and we mostly know how atom physics and chemistry work, that any body with this basic knowledge of atom physics can understand and even build an ICE. Good luck trying. It's similar with a brain. Yes, synapses play a role. But that doesn't mean a brain is "nothing more than an LLM".
Humans arrive out of the VJJ with innate neural architectures to be filled and developed - not literal blank slates, there is an OS. The electrical activity during development is literally the biological process that creates our "base programming." LLMs have architectural inductive biases (attention mechanisms, etc.), human brains have evolved architectural biases established through fetal development. We're both "pre-programmed" systems, just through different mechanisms.
Your response about "electrical activity decreases over time" is irrelevant - you weren't talking about adult brain activity, you were talking about the developmental process that creates our initial neural architecture.
tbh: I can't tell if you're engaging in good faith or not.
Well, I prefer it that way, but the spirit of "AI" seems to go in another direction, and the leadership of US government also does, so maybe times are just changing.
It’s incredibly easy to get LLMs to do a lot of stuff that seems convincing.
They are literally trained for plausibility.
This is epic work. Would love to see more of it but I guess you're gonna take it the startup route since you have connections. Best of luck.
For instance, you might have an SEO expert on the team, but that alone won't guarantee top search engine rankings. There are countless SEO professionals and tools (human or AI-powered), and even having the best one doesn't eliminate the underlying challenge: business competition. LLMs, like any other tool, don’t solve that fundamental problem.
This sounds like those guys in social media that one up each other with their bed times and end up saying they wake up every day at 2am to meditate and work out
And other companies have existed for hundreds of years and had thousands of people work for them and never even made $100M.
The forking is free. Running the sub-agents is linear cost, but the expensive bit is joining the agents responses back together again.
If a task has 6 subtasks and an agent is spawned for each, at some point some 'joiner' agent needs to parse and summarize the findings of the sub agents and feed it back to the parent. That step necessarily involves information loss, and uses more computation that a single linear agent design would not use.
Might it be just another realization of Conway's law?
https://en.wikipedia.org/wiki/Conway%27s_law
Might it be possible that the only reason you're assuming a system is junk is just that it doesn't resemble the systems you know and expect? There are so many ways to skin a cat, and certainly no business process represents the optimal process.
And this is precisely how really bad things could happen:
https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality...
You don't.
Sincerely, Your Electricity Bill
Most of what people use agents for daily can often be one-shotted though and even collating/rating 10 results would be costly.
If I had a harness for evaluating the results and VC level money, I'd be throwing an army at well defined experimental tasks as well.
People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.
For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program, it is much less likely that applying ML/AI can be successful, except as a substitute for searching already known solutions.
On the other hand, for less documented microarchitectures, like of the NVIDIA GPUs, finding an optimal program may be impossible other than by doing a random search guided by examples of previous optimized programs, and possibly doing some reverse-engineering work to determine the real behavior of the GPU in some circumstances.
Improving over something like this is likely to be feasible for ML/AI, where training over known good programs may be able to extract some of the undocumented behavior that may be non-obvious for humans reading those examples.
We don't even know the optimal algorithms! AlphaEvolve recently found "an algorithm to multiply 4x4 complex-valued matrices using 48 scalar multiplications, improving upon Strassen’s 1969 algorithm that was previously known as the best in this setting." - https://www.nature.com/articles/s41586-022-05172-4
You severely underestimate the landscape of possible implementations for these kernels. There are many ways of performing a matrix multiplication and predicting which one will perform best without running them all is nontrivial, even with perfect knowledge of the underlying system.
This is just a completely incorrect take, speaking as a former insider.
For a processor that is completely documented one must be able to run a simulation model that provides the running time for a given program, when the processor is so complex that simpler methods for computing the execution time do not work.
For older NVIDIA GPUs, there exist such simulation models, but they are only partially accurate, because they are based on reverse engineering, without cooperation from the GPU vendor.
You can't precompute all the possible options in advance and fetch the running time from a database either, because the parameter space is just way too huge.
Notice that none of this has anything to do with having accurate models of the system. This is what people who do this for a living and have perfect knowledge of the system choose to do, for good reasons.
For register allocation and instruction selection, there is hope because it is FPT and there are algorithms to do it optimally in polynomial time, albeit with a large constant factor (FPT), making it impractical to apply to compilers as of today. For instruction scheduling, it is just too hard. If you read literature on scheduling algorithms, it is NP-hard for apparently simple instances, e.g., 2 parallel identical machines with no preemption and bounding completion time (https://www2.informatik.uni-osnabrueck.de/knust/class/), while actual microarchitecture is much more complicated than this...
Needless to say, these are already the simpler problems. The longer the program or the more profiling data you can optimize for, the more tricks you can throw at it, and most of them are NP-hard to optimize optimally.
Being NP-hard doesn't imply that you can't obtain the optimal result, but compilers that I know of do not implement them, because most users are not willing to wait for days for such a compilation to complete. Ideally, one should make something that can run on clusters of CPUs or GPUs to optimize this, and people having those clusters will typically be willing to do this because they want to optimize the program they later run on the clusters. However, to my knowledge, no one is working on this at the moment.
The latter happens when there is one dominant bottleneck for the algorithm, which is determined by the hardware, e.g. the maximum throughput of a certain instruction, like multiplication or memory load. When the implemented program reaches a throughput almost equal with that absolute limit, then one can be certain of its optimality.
You don't know what you're talking about - no compiler engineer in their right mind would intentionally use a randomized algorithm in a compiler. It's a bug every single time and it gets squashed immediately.
1. You can dump the SASS that corresponds to PTX: `cuobjdump --dump-sass <input_file>`
2. Getting the cycle count of a single instruction for an OOO architecture is completely meaningless because you have no idea when the instruction will actually be issued. This is true for both AMDGPU and NV.
Wow, so, you're basically saying the AI created new algos in a domain with no pre-existing solutions? Awesome!
The implication was that the FP32 versions of these kernels have lagged behind the more popular versions. There was opportunity to translate the advancements from other kernels into these. Someone would need to look closely to see exactly what was done, but it’s premature to suggest anything like “new algos” or “no pre-existing solutions”
This is a great use case for LLMs, though. I often do something similar where I make improvements to something I use most frequently and ask an LLM to translate that pattern to other similar parts of the code.
Help me understand this 'cause I'm a bit slow these days ...
Does that mean optimized FP32 versions of these kernels were already there or not?
If you're trying to support your original point with that argument, then you're using some pretty awful definitions of the terms "new algos" and "no pre-existing solutions".
If I do `sed 's/f32/f16/g' kernel.cu` does this count as AI? Help me understand because I'm a little slow when it comes to all the dumb shit people attribute to LLMs these days...
>sed 's/f32/f16/g' kernel.cu
This is not what's happening here, it's a completely different thing, read TFA.
That being said, it is cool if AI is enabling lower cost adoption of better more optimized kernels with less effort.
Read the damn comment you're responding to. There have been human written kernels for both fp16 and fp32 for a long time.
Here is the corrected version of your comment:
"Wow, so, you're basically saying the AI created the same but faster algos in a well known domain with established pre-existing solutions, whose overall impact on the runtime of practical workloads is insignificant? Awesome!"
[1] https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...
[2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
Using it on the architectural side you can follow a similar procedure but instead of describing a bug you're describing architectural revisions you've gone through, what your experience with each was, what your objectives with a potential refactor are, where your thinking's at as far as candidate reformulations, and so on. Then finish with a question that doesn't overly constrain the model; you might retry from that conversation/context point with a few variants, e.g.: "what are your thoughts on all this?" or "can you think of better primitives to express the system through?"
I think there are two key points to doing this effectively:
1) Give it full, detailed context with nothing superfluous, and express it within the narrative of your real world situation.
2) Be careful not to "over-prescribe" what it says back to you. They are very "genie-like" where it'll often give exactly what you ask for in a rather literal sense, in incredibly dumb-seeming ways if you're not careful.
Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".
How can it evaluate accuracy if it can't even detect contradictions reliably?
Permit my likely inaccurate illustration: You’re pretty sure 2 + 2 is 4, but there are several questions you could ask: are any of the numbers negative, are they decimals, were any numbers left out? Most of those questions are things you’ve learned to ask automatically, without thinking about it, because you know they’re important. But because the answer matters, you check your work by writing out the equation. Then, maybe you verify it with more math; 4 ÷ 2 = 2. Now you’re more confident the answer is right.
An LLM doesn’t understand math per se. If you type “2 + 2 =”, the model isn’t doing math… it’s predicting that “4” is the next most likely token based on patterns in its training data.
“Thinking” in an LLM is like the model shifting mode and it starts generating a list of question-and-answer pairs. These are again the next most likely tokens based on the whole context so far. “Reasoning” is above that: a controlling pattern that steers those question-and-answer sequences, injecting logic to help guide the model toward a hopefully more correct next token.
> The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation.
My conclusion would be that we’ve now learned to apply LLMs’ capabilities to shrink solution space where we have a clear evaluation function as well as solutions to problems that might follow similar patterns. This applies in this case as well.
IMO, It’s not about model X gaining on other models or model Y being able to reason about the solutions, etc. in a way that other models couldn’t.
When 3.0 comes out, that... that's going to start getting a little scary.
The problems I have to solve tend to be the horrible ones that nobody has answers to, anywhere on the Internet, so unsurprisingly the AIs aren't good at it either.
The trick has been to use the AIs for what they are good that, which used to be "nothing" for me at least, but now I can use them productively for certain "spot" tasks.
Random examples:
- Cross-language and cross-platform benchmarking of a bunch of different database clients to see how they stack up. I gave the AI a working example in one language and got it to whip up a series of equivalents with other DB drivers and languages. Sure, it's trivial, but it's way faster than doing it myself!
- Crash dump analysis using WinDbg. I read somwhere that "vibe debugging" of kernel dumps totally works, so when I had an actual crash I gave it a go for laughs. With AI help I managed to extract the name of the specific file that had NTFS corruption and was crashing the server. Deleted the file, restored it from backups, and the server was good to go again!
- If you ever watch the top mechanical engineers on YouTube, they all make their own tools instead of just buying them. Jigs, extenders, unusual sizes, etc... IT work is the same. As a recent example, I got Gemini to make me a code-AST rewriter for a specific issue I wanted to clean up in bulk across a huge code base. Using the Roslyn compiler SDK is a bit fiddly, but it spat out a working tool for me in under an hour. (This is not something you can solve with a script full of regex, it needed a proper parser to handle commented-out blocks and the like.)
That's the clincher for me. So much software work is just excecuting on a design, not inventing anything new. Being able to do 5x the trivial work in an hour is life changing, and it lets me pull my head out of that work to see how I can make larger process improvements. AI doesn't need to rewrite the linux kernel in Rust to be extremely valuable to the average developer
I've been pair programming with the models for a while, and wrote some "agents" before I knew to call it that back in the dark days of GPT-3.5, but only recently with the latest models unlocking capabilities beyond what I could achieve with handwritten code.
that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.
maybe this intuition is wrong but would be great for the work to address it explicitly if so!
Replacing float32 operations with float16 is also pointless. There is nothing to be gained by doing this, as it removes the actual accuracy advantage of float32s, which would the single most important reason to use that version of the algorithm.
I think this error is large enough that referring to it as FP32 is misleading.
Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.
But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.
EDIT: looks like they've since generated another one that is numerically stable! great work
ML guided heuristic search over compute schedules is as old as 2013 (Halide for image processing)
The PyTorch code base is NOT written by performance experts in any way. This is the wrong base line. Nothing about that code base is clean or hand optimized .
The "AI" generation methodology seems to give many instructions and even descends into instruction trees, manually throwing away results etc. So it requires, as usual, extreme guidance.
Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.
Seems doubtful as this works only on an extremely well-defined evaluation function.
It could be that the end result is the knowledge of strict boundaries of LLM capabilities, that they can only operate in specific domains, or only improve to a certain extent, and some currently unspecified defect limits the level of improvement.
The underlying idea of specifying a domain and task conditions, then letting an LLM run thousands of experiments, is a great search technique. The hope is that there is no implicit defect and that the methodology will extend and generalize - it's not too complex a notion to think that you could have an LLM create a broad range of individual tasks, with a meta-goal of identifying better and more general recursive improvement processes and algorithms.
Again, entirely different idea that doesn't have a straightforward evaluation function. As it stands, this is more akin to genetic programming with a very good mutation function.
It's just like image generation: the first iteration is the worst it will ever be.
https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteris...
A function that is meant to be executed in parallel on an attached GPU is called a kernel. In CUDA, a kernel is usually identified by the presence of the __global__ specifier in front of an otherwise normal-looking C++ function declaration.
(Edit, typo)
I was thinking this was about leaking the kernels or something, but no, they are "publishing" them in the sense of putting out the blog post - they just mean they are skipping the peer review process and not doing a formal paper.
That said, after quickly skimming the example AI-generated kernel I am not seeing anything novel there. While working at nVidia I did see a handful of techniques that, frankly, blew my mind.
Thus, I wonder what makes this AI-generated kernel faster than the standard pyTorch kernel, which I presume is simply delegating all the heavy lifting onto cuDNN. My guess, and it's just a guess, is that they are comparing the fastest AI-generated kernel they produced for a very particular set of parameters against whatever kernel cuDNN is picking for that same scenario, and perhaps the subsystem inside cuDNN that picks which kernel to execute out of the very large database it manages chose a suboptimal candidate. Researchers tend to completely ignore this issue and assume that cuDNN is always able to choose the very best kernel in every possible scenario, something that is just not realistic.
Maybe there is something else going on, but these sort of "we have beaten this heavily optimized proprietary library" always seem to miss this very important point.
Kind regards to any NVidia insiders who may read this. You guys are the brightest people I've ever met.
All of this stuff is way outside my wheelhouse, but maybe "the standard pyTorch kernel" is just a low bar? (https://news.ycombinator.com/item?id=44144346)
There's little doubt in my mind that the authors of the blog post went after the largest performance delta between their best AI-generated code and the default pytorch implementation, without looking into real-world nuances like "picking a good kernel out of a zillion options" or "production-quality output accuracy". It's cool that this sort of thing is being researched, but the results need to be taken with a grain of salt. Maybe I'm too jaded.
At the very least they could have used consumer hardware. I don't even know how to parse that model it's so consumer-alien.
If so, why is it surprising that generic implementations in PyTorch are worse?
This is fundamentally different to how any human would approach this problem. And also different to how some recent advances in this area were made, where AI actually came up with superior and correct algorithms.
This approach also seems quite unfortunate and makes many of theses results somewhat doubtful.
IIRC there was another paper recently, with similar methodology about computing xAx. These papers produce algorithms which aren't empirically correct, but provably correct. They do this by operating on a graph data structure, which describes the algorithm and then verifying the algebraic equality to the correct result.
There is a substantial difference here. And I think utilizing algorithms which only are empirically correct can be dangerous.