As much as I like Claude Code, Boris has done a lot of harm by encouraging software engineering practices that lead to slopware. We have two camps of people at work, the first camp are the agent goes brrr. They don't understand the code they write. They have loops running, agent orchestrators or agent hype du jour. The second camp is people who are inundated with PRs, are holding the line on quality, and just exhausted. We've also had some management pressures where they think people are wasting time looking at code. Perhaps because some podcast they might be listening to, somebody says coding is largely solved.
> I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.
This is going to be a net negative on software quality for people who take this up, in my opinion.
I call out Boris but I also don't think he's being malicious. He's at the center of an important technological revolution and it would be hard not to get excited. I just wished he advocated for a more balanced and a realistic perspective.
Yes, I am exhausted. Most of my company is obsessed with agents, because everyone wants to be seen as AI first. There is little thought going into usage. No care for long term maintainability and quality. Our product is actively worse by many metrics, but no one cares because we marketing can say “agents”.
The sad part is that this technology is incredible. It’s us choosing to turn it into a slop cannon (and the labs sure seem to encourage this).
Quoting the creator of CC holds little value in my opinion. I too call my product good.
> opting out of this fully machine-driven future may not be an option.
I am contemplating whether I want to stay inside this rat race.
I completely agree with the conclusion of this blog post, by the way. I feel uneasy, and I do not enjoy the work I deliver using LLMs. I think OP did a really good job on capturing at least my current state.
I and my friends go back and forth, every day, on whether coding with LLMs is a net plus or a net negative.
I'm at the point where I think it's dumb to not do it but also dumb to do it. I have no real answer.
I have settled on using LLMs for everything but to spend more time honing the quality and cleanliness with LLM passes afterwards than I generally would have taken to write it well myself in the first place. This is in some ways the worst of both worlds, but it somehow lets me bypass akrasia while still getting pretty good code out, so I consider it superior to how I worked before. I get more done in three months even if I get less done in a day.
> I feel uneasy, and I do not enjoy the work I deliver using LLMs.
I have basically stopped writing code in my spare time since the advent of AI. Before I felt like I was working on a classic car. Was it a practical use of my time? No. I could go out and download software that did what I wanted. Did I have fun doing it? Yes, the act of working on it was important, I felt I was still learning and improving as I did.
Nowadays I see people doing far more in a month than I could in a year and I feel like its all a waste, like I just spent the past few years transcribing a phonebook while standing next to a photocopier.
I don't know if that'll ever change. I can't even pretend I was doing something prestigious and artisan like watchmaking because I wasn't a good programmer beforehand.
I used to think I'll be into coding for the long haul, contributing to open source, and working on multi-year side projects.
Nearly all of that passion vanished this year, and I've been struggling to replace it. I know I'm much better than the machine now, but the lines are starting to blur, and some of the small puzzles of day-to-day have been completely automated away.
We've birthed a lot of puzzle solvers that enjoyed programming, and I'm sure many of them will move on to something else that scratches the same itch. I'm keen on learning what that will turn out to be.
I was misunderstood you if you intend to write code by hand, I still did, I use AI to learn by example, but I write the real code myself, AI can help me improve the code. I learned a lot.
Before I would just throw prompts at the LLM and it'd end up building a pile of crap (but semi-working crap, and 100x faster than I ever could) - it was pretty depressing. Using tools like `grill-me` (or `grill-with-docs`) I feel like I'm actually building my understanding of the system and helping shape it, and the results are much better.
The fun part about that `grill-me` command is that when the questions are over, I've found that I can go right into implementation without needing to dump a PRD or some sort of broken up plan. Now this is obviously completely predicated on what you are asking it to grill you on. But for tasks that are semi complicated, it's fantastic.
I had an initial high where I was knocking out side projects left and right, but I eventually got to the point where I wasn't understanding my own codebase or product. Then the satisfaction left. I started getting dopamine crashes when I hit limits. Like a true addict, I started thinking "maybe I should upgrade to max"... "maybe I should just enable usage credits this one time..."
And the "Good jobs" or "awesome work" felt like stolen valor, rather than validation of a job well done.
I started to ask the AI to walk me through the code base piece by piece. I used a less powerful model.
What clicked for me, and I am here assuming you and I are on the same wavelength, is to treat the model like a junior pair programmer. The thoughts are mine, the ideas are mine, the coding style is mine. You have to be way more specific and precise with the instructions and prompts, and scaling down the model is a must -- absolutely no frontier models, Sonnet is even too much at times.
I'm the opposite, couldn't be bothered to work on code outside of work. Barely did at work because I was more focused on wrangling a small army of shitty contractors (thanks strategic partner initiative for firing all of our small shop contractors and replacing them with morons from "offshore").
Now with LLMs I find myself doing small projects that interest me or have some utility for me outside of work, and doing a lot more development in the codebases at work outside of just review/docs/arch than I was before. Also making small tools that I find pleasant/useful but were not important enough to spend time on before.
Agreed - there was always a set of things I wanted to do that I knew the magic core for, but wanted a team of implementers for the curft, the 100k of actual testing harnesses, hyperparameter exploration, etc.. . I now have that team of implementers. All the problems seem research-y though - optimal binary transport systems that are zero-copy and compatible with languages, fast physical simulation optimizers, etc etc... So, things that all had a _LOT_ of busywork around the magic core.
In my own ham-fisted experiments with coding loops, one pathology I have noticed is that the LOC just spirals out of control. That's likely because of the layers of defensive fixes, etc., that get built. That inevitably causes context bloat (or at least navigational friction) and results in quality decline.
I wonder how many loop-related issues could be addressed by simply fixing a LOC budget, or assigning a cost in some way. Unclear how you would dial in the right numbers, though.
Loops work when you spend the proper amount of time to understand what you want ahead of time. The prerequisite is clarity — enough clarity that you could write a careful specification that you could hand off to a junior colleague.
Often, it takes 5-6 broken crappy versions of a thing until you understand that. There is no accelerating the 5-6 broken crappy versions - there’s no agent tech that’s going to help your meat brain avoid thinking time.
So most of my time is iterating between these two phases: I don’t understand what I want, I need to read and write and play with code, okay it’s been long enough I think I know what I want (it is extremely easy to deceive yourself) … okay now I do actually know what I want and I can write a loop.
Many people think they can jump ahead with agents. You cannot fake understanding or clarity. It is painfully obviously when someone skipped that meat brain understanding phase.
I had codex write a tool to extract all my pi sessions. (Had to filter out my prompts from the agents talking to subagents).
Then I had it analyze the patterns i was making and turned that into the flowchart for the outer guidance-creating-prompt.
I didn't have to spend too much time thinking what i wanted. I wanted it to do that.
The result is still mixed, and i'm not trusting it with delicate code bases, but for a game i've been building i dropped my check-in time to 1/5th i was previously spending on it.
Thats not a good thing per-se. I'm sure i'm missing good ideas by _not_ spending time with it. But previously I really had stagnated with my prompts becoming mechanical #now-do-this and #now-review-that with 90% of its suggestions being correct.
Just need to (automatically) remind it to "do the hard stuff first, clean up & refactor as you go" as well as a "reflect on your work" after its first return to get it to spill the beans on any crap left behind, and then process that in the guidance-creating-prompt to dish out new work.
>Yet even with a lot of manual steering, that type of code does not come out of LLMs naturally, and even if the code comes out naturally like that, they will still attempt to handle now impossible errors.
This is something I’ve struggled to fight against in many PR reviews. Especially once already written, convincing someone that their excessive null checking is harmful is an uphill battle. Short of better modeling (and languages that allow for sum types to enable it), I haven’t been able to come up with a universally convincing argument against this kind of “shotgun parsing.”
Maybe it really just isn’t that big of a deal? But when actually reading through and refactoring a codebase I’ve always found it frustrating to manage these unnecessary checks. Sometimes they’re nearly impossible to delete safely once present without first adding some kind of logging or broad investigation.
And AI code reviews encourage overly delusional defensive paranoia. triple null checking deep inside a function is technically a real risk, but in practice should never be hit because you've checked for nulls in every function that calls or could call the function in question and is thus not necessarily worth guarding against.
> the right fix is not "handle every malformed case." ... [LLMs] will still attempt to handle now impossible errors.
This is the number one code smell from LLMs and I don't know why they are so obsessed with it. In python, it often comes as `hasattr` checks on types that are defined to have that attribute, in a code base that is fully type-checked.
Why do they do that? Is it from pre-training or re-enforcement? If that latter, can the labs please fix this?
I keep thinking about at which point I should not force myself into the loop. As a developer I really like working on the code structure, making it clearer, thinking about good abstraction, breaking into modules, etc. I really take pleasure in it. At the same time I understand that at some point I am becoming the limiting factor.
If the point of the software is benefit people, should I still care about how the code looks.
Right now, I still think that the answer is yes, but in 3 years? in 10 years?
> My current status is that I have not had much success with this way of working for code I deeply care about
If something is judgement heavy, "code i care deeply about", then i don't really agree with the direction of travel here. Don't try to delegate decisions you care deeply about.
I do like the framing of agent loop vs harness loop, but only delegate stuff that you can accurately specify in advance, that usually means stuff that's repeatable in my case ("hey go see how i did X, do that but for Y"), and that inherently means stuff that's predictable.
For stuff where lack of my judgement as input is just going to cause me to say "no", we're down to collaborating in the "agent loop" as Armin puts it. And that's totally fine. It's fast, but also safe.
Remember before AI coding assistants, sometimes you'd get an engineer join your team who was SUPER productive, your peers would be jealous "oh yeah but you guys only got all that done because you have X on your team!" - they didn't live the curse of having that kind of person around - if you don't have them PERFECTLY aligned, then they run off at break neck speed in the wrong direction.
I'm a software developer from way back, using tools and languages that coding agents are far less familiar with.
So when I use an agent to write code, it's in languages I'm less familiar with, and often using libraries I know nothing about.
All to say, my part of the process often ends up being:
1. "Here's what I'm looking for, in detail"
2. "That's not right. Here's one way it's not right, and a specific example. Please fix that."
3. Sometimes I give suggestions for how what is going wrong might be happening, or conceptually how to work around the issue.
4. And iterate on 2-3 until the result is close enough.
The issue is that whilst the loops will initially lead to good results they will be less and less as context gets bigger and bigger and tougher to understand for human and AI.
We used a “loop” before it was called that to drive MS-DOC support into Tritium. Based on that experience, I take issue with this:
“There are already impressive examples of large automatic porting efforts, including the reported work around moving parts of Bun from Zig to Rust.” (Emphasis added.)
It will be impressive if/when the Bun team is able to pick up and continue extending and supporting Bun. For us, MS-DOC remains read-only and probably perpetually buggy until we reimplement with a better understanding. Until then, it’s definitely not “impressive”. Functional? Maybe. Impressive, no.
The post suggests fear about a surge of increasing amounts of code by loops and loops of agents.
I don’t know if I like the current world without it though.
80% of different teams code the code is poorly tested. The code doesn’t handle data consistency or asynchronous code properly because the engineers don’t know better (and frankly don’t care enough).
Dependency handling is poorly managed leading to low quality operations with improper dashboards, alarms, and ops.
Badly managed processes leads to people doing monkey work signing off checklists rather than automation.
Frankly… why is keeping any of that good? It really pisses me off seeing people accept any of that low quality but that standard is the default and not the outlier.
We've had great success with agents thus far at my job. A year into Clauding and all our dev metrics are up while our downtime has remained steady.
Being an iOS engineer, much of my engineering cycle these days is going from Figma/PRD → spec → code. After being handed off to QA, we handle the bugs and product slips as they come through, while we simultaneously build/spec the upcoming addition. This is basically the same agile style that's been popular for 20y, just super-powered with agents.
How might someone accomplish the same goals using loops instead?
I personally have not had good luck with loops due to similar issues as the post author - but if you were to port your flow to "looping" it would be something like:
- An automation that periodically checks for PRD's at a given location that have not yet been implemented.
- If it sees one not implemented, it puts a lock on it (so other agents later don't pick it up while its still working) and implements the PRD in code, assuming it has the figma link and all specs required.
- When its done it makes a PR, waits for if it passes and even in some cases automatically merges into your staging/preview enironments and just pings you with a build/URL. You can then leave feedback or something and it can also also poll for pending feedback. Or you just mark it looks good, the agent then merges the PR, moves the PRD to implemented status, maybe even writes/updates docs and cleans up any temporary work.
- Repeat checking for new PRD's every T unit time. (10 minutes, 1 hour, etc)
This is how people say you should be looping - you never even cared or looked at the code, and also never prompted the agent yourself.
But I find most agents are often pretty bad still at replicating UI vs making something from scratch and most design specs are still not as detailed around how things look at all sizes, in all scenarios etc. Design seems to be one of those things that still requires a human to validate. And then all the things the post author mentions about it not being willing to apply hard constraints, minimize impossible states, validate at edges and prevent horrendous overchecking of things. etc.
Use appium or XCTest or swift testing; generate the tests first (failing) from the spec.
The loop is basically then a while loop:
While (tests fail) { trigger agent: spec, failures list }
for bugs, write failing tests.
Its basically TDD.
Loops do nothing useful beyond making the “spec -> code” step more “hands off” and let you be confident that the code you write does what is intended.
Obviously you see the issue: writing the loop harness is > effort than not having it…
…but the idea is that you run “spec first” and are totally hands off on the code, just updating the validation step and then waiting while the agent iterates over and over to solve for some solution that passes the loop harness.
People suggest that it is possible to go, eg. directly figma/jira to harness via (random tool here), saving even more time and invoking even fewer humans, but thats currently, as far as I can tell, actually just hype.
No one is actually doing that effectively.
Loops are currently carefully hand crafted, which makes them tedious and of questionable value imo.
Would you have a breakdown of costs/benefit? Can you say with certainty that this workflow has increased productivity so much that you are seeing profit increases that you wouldn't have otherwise noticed just by hiring more people?
Asking with no ill intention, I just crave for actual business cases that make sense, and yet no-one seems to be able to reliably produce that.
There's _way_ more than one way to do "loops". I just asked one of my superviors/auditors to document how it's been working while monitoring a few other agents that have long-term goals:
A friendly reminder to just do 9 to 5 and touch lots of grass. None of this shit represents industry trends, majority of people still use chat interfaces and copy blocks of code. There’s zero early adopter advantage here, only FOMO and lots of anxiety.
This is a very fatalistic take. While I understand where it's coming from, I try not to share the same mindset: engineers getting increasingly distant from how things are getting built is not something that will "undoubtedly happen, whether we like it or not".
Also:
> Now there is obviously a question if this desire to understand the code is one that I will still have a few years from now.
I do not think we should be having doubts like this. Either you consider understanding the code you ship and allowing your future self to be able to work on the system you're building to be a value, or you don't. I, for one, do, and I do not think using LLMs and coding agents will affect my point of view on that.
I think this is a common sentiment among heavy users of AI that also still cares about code quality.
I've built up a skill harness and review flow that makes Opus generate slop-free code 90% of the time. But the remaining 10% requires me to stay at the helm. Especially in the early stages.
I would love to use loops to automate more, but I couldn't do it with the current generation models.
And on the back of my mind I'm still evaluating the possible future where we are forced to API pricing. I'm currently paying $400 for Opus, and use around 1.5-2 billion tokens per day. This will cost around $20k/m with API pricing. And I don't want to even imagine the possible scenario of getting locked out of frontier models because of politics.
Will the models get better to cut me out of the loop completely? I believe so.
Will the open source models catch up tho SOTA models, and diversify from China-only? I hope so. Otherwise 2 superpowers will wield a soft power that can cripple the tech industries of all other countries.
I honestly wonder if this kind of stuff really brings something to the table. Like I use opus for sometime and certainly I can put it to good use and optimize some parts of my day to day job (programmer). But it fails so hard in such simple tasks that it seems to me that putting it in loop can't just magically make everything better, unassisted. Does anyone actually uses agents and loops to create new software, new technology? Has anyone created with those systems, software they couldn't produce otherwise technologically wise? Or is it at best just an accelerator, cutting off on the building time?
Show me the billion dollar solopreneur startup, or the profit increase for companies and at that point I’ll start thinking that this tasteless high level wanking might make sense in some way
One car went Mach 1, ever, apparently. Anyway, I don’t think the analogy fits. Ford or whoever didn’t loudly and frequently predict Mach 1 cars, right?
The situation is more like: Altman & co are predicting their new car will replace all vehicles: horses, trains, planes, motorcycles, there’s a real possibility the concept of vehicles will not exist other than cars, in the future. Meanwhile it hasn’t really done highway speeds yet. It does some impressive runs on curated tracks, and people use it around their farms (it seems to work ok for some of them).
It is a terrible analogy that shows terrible thinking.
After all, there's one thing we can bet with more confidence on: delegating thinking to this mediocrity machines is affecting the ability to do the same in scores and scores of previously smart people.
Looping is (currently) a side effect of token subsidies.
If token costs are nil, then you can afford to run verification and generation through the same models. If token costs are high, then you will go broke verifying code sprawl.
Currently costs are (mostly) absent from the conversation, even though costs are what decide the limits which shape experience.
Also: Firms can be held liable for the products they sell, so if code cannot be reviewed then that code is essentially a law suit waiting to happen. I believe this is what customers will be demanding in the future: someone to hold accountable when things go wrong.
> For now I have not moved past the point of comprehension being important to me.
Ah ! This is me too... at least for what I have to ship at work. Not so much for my toy/weekend projects. But it turns out agents are also good at explaining.
I think it's insane to suggest that software developers should ever get to the point where they don't even comprehend their code.
Before someone else says it, no I don't read the assembly code that is produced by my compilers. However, I can generally predict what kind of assembly will be produced, and the result is deterministic unlike LLMs. It seems like most vibe coders scoff at the idea of even looking at the code, and it just seems untenable to me when we're working with (usually correct) stochastic parrots.
This commit seems to mostly be grammar fixes? If someone used a spell/grammar checker it might produce a diff similar to this. Why does the fact it was Claude and not Microsoft Word or other matter in this case?
Yeah I don't know. Don't get me wrong, the article points makes sense. But sometimes I think that we're going to stay near this current point of productivity for a little while.
Currently my org of 8 people use around 1000 euro worth of tokens per month. We've recently had a discussion near the water-cooler, that if the cost climbs 5x-10x it may be just more worth it to get more developers (we're EU based). While the tools work and are definitely nice, even in our little org with our little budget, using Opus 4.8 we've noticed code quality going down.
If I had to bet money, I'd bet that the models will get 30-50% more nice, around 2x more expensive and we will settle into some mode where we'll use llms for some tasks, manually doing others and calling places focusing on speed at any cost some funny name like "gulags, 996, sweatshops, etc" and collectively try to somewhat avoid those places, which will need to offer a premium to attract talent. Wishful thinking.
> I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.
This is going to be a net negative on software quality for people who take this up, in my opinion.
I call out Boris but I also don't think he's being malicious. He's at the center of an important technological revolution and it would be hard not to get excited. I just wished he advocated for a more balanced and a realistic perspective.
The sad part is that this technology is incredible. It’s us choosing to turn it into a slop cannon (and the labs sure seem to encourage this).
I want to leave the industry as soon as I can.
> opting out of this fully machine-driven future may not be an option.
I am contemplating whether I want to stay inside this rat race.
I completely agree with the conclusion of this blog post, by the way. I feel uneasy, and I do not enjoy the work I deliver using LLMs. I think OP did a really good job on capturing at least my current state.
I'm at the point where I think it's dumb to not do it but also dumb to do it. I have no real answer.
I have settled on using LLMs for everything but to spend more time honing the quality and cleanliness with LLM passes afterwards than I generally would have taken to write it well myself in the first place. This is in some ways the worst of both worlds, but it somehow lets me bypass akrasia while still getting pretty good code out, so I consider it superior to how I worked before. I get more done in three months even if I get less done in a day.
I'm not enthusiastic about the field anymore, which sucks, because I used to love working in programming.
I have basically stopped writing code in my spare time since the advent of AI. Before I felt like I was working on a classic car. Was it a practical use of my time? No. I could go out and download software that did what I wanted. Did I have fun doing it? Yes, the act of working on it was important, I felt I was still learning and improving as I did.
Nowadays I see people doing far more in a month than I could in a year and I feel like its all a waste, like I just spent the past few years transcribing a phonebook while standing next to a photocopier.
I don't know if that'll ever change. I can't even pretend I was doing something prestigious and artisan like watchmaking because I wasn't a good programmer beforehand.
Nearly all of that passion vanished this year, and I've been struggling to replace it. I know I'm much better than the machine now, but the lines are starting to blur, and some of the small puzzles of day-to-day have been completely automated away.
We've birthed a lot of puzzle solvers that enjoyed programming, and I'm sure many of them will move on to something else that scratches the same itch. I'm keen on learning what that will turn out to be.
Here is the similar perspective: https://isene.org/2026/05/Audience-of-One-Numbers.html
I was misunderstood you if you intend to write code by hand, I still did, I use AI to learn by example, but I write the real code myself, AI can help me improve the code. I learned a lot.
Before I would just throw prompts at the LLM and it'd end up building a pile of crap (but semi-working crap, and 100x faster than I ever could) - it was pretty depressing. Using tools like `grill-me` (or `grill-with-docs`) I feel like I'm actually building my understanding of the system and helping shape it, and the results are much better.
And the "Good jobs" or "awesome work" felt like stolen valor, rather than validation of a job well done.
I started to ask the AI to walk me through the code base piece by piece. I used a less powerful model.
What clicked for me, and I am here assuming you and I are on the same wavelength, is to treat the model like a junior pair programmer. The thoughts are mine, the ideas are mine, the coding style is mine. You have to be way more specific and precise with the instructions and prompts, and scaling down the model is a must -- absolutely no frontier models, Sonnet is even too much at times.
This might bring back some of the satisfaction.
Now with LLMs I find myself doing small projects that interest me or have some utility for me outside of work, and doing a lot more development in the codebases at work outside of just review/docs/arch than I was before. Also making small tools that I find pleasant/useful but were not important enough to spend time on before.
I wonder how many loop-related issues could be addressed by simply fixing a LOC budget, or assigning a cost in some way. Unclear how you would dial in the right numbers, though.
Often, it takes 5-6 broken crappy versions of a thing until you understand that. There is no accelerating the 5-6 broken crappy versions - there’s no agent tech that’s going to help your meat brain avoid thinking time.
So most of my time is iterating between these two phases: I don’t understand what I want, I need to read and write and play with code, okay it’s been long enough I think I know what I want (it is extremely easy to deceive yourself) … okay now I do actually know what I want and I can write a loop.
Many people think they can jump ahead with agents. You cannot fake understanding or clarity. It is painfully obviously when someone skipped that meat brain understanding phase.
Then I had it analyze the patterns i was making and turned that into the flowchart for the outer guidance-creating-prompt.
I didn't have to spend too much time thinking what i wanted. I wanted it to do that.
The result is still mixed, and i'm not trusting it with delicate code bases, but for a game i've been building i dropped my check-in time to 1/5th i was previously spending on it.
Thats not a good thing per-se. I'm sure i'm missing good ideas by _not_ spending time with it. But previously I really had stagnated with my prompts becoming mechanical #now-do-this and #now-review-that with 90% of its suggestions being correct.
Just need to (automatically) remind it to "do the hard stuff first, clean up & refactor as you go" as well as a "reflect on your work" after its first return to get it to spill the beans on any crap left behind, and then process that in the guidance-creating-prompt to dish out new work.
This is something I’ve struggled to fight against in many PR reviews. Especially once already written, convincing someone that their excessive null checking is harmful is an uphill battle. Short of better modeling (and languages that allow for sum types to enable it), I haven’t been able to come up with a universally convincing argument against this kind of “shotgun parsing.”
Maybe it really just isn’t that big of a deal? But when actually reading through and refactoring a codebase I’ve always found it frustrating to manage these unnecessary checks. Sometimes they’re nearly impossible to delete safely once present without first adding some kind of logging or broad investigation.
This is the number one code smell from LLMs and I don't know why they are so obsessed with it. In python, it often comes as `hasattr` checks on types that are defined to have that attribute, in a code base that is fully type-checked.
Why do they do that? Is it from pre-training or re-enforcement? If that latter, can the labs please fix this?
If the point of the software is benefit people, should I still care about how the code looks.
Right now, I still think that the answer is yes, but in 3 years? in 10 years?
If something is judgement heavy, "code i care deeply about", then i don't really agree with the direction of travel here. Don't try to delegate decisions you care deeply about.
I do like the framing of agent loop vs harness loop, but only delegate stuff that you can accurately specify in advance, that usually means stuff that's repeatable in my case ("hey go see how i did X, do that but for Y"), and that inherently means stuff that's predictable.
For stuff where lack of my judgement as input is just going to cause me to say "no", we're down to collaborating in the "agent loop" as Armin puts it. And that's totally fine. It's fast, but also safe.
Remember before AI coding assistants, sometimes you'd get an engineer join your team who was SUPER productive, your peers would be jealous "oh yeah but you guys only got all that done because you have X on your team!" - they didn't live the curse of having that kind of person around - if you don't have them PERFECTLY aligned, then they run off at break neck speed in the wrong direction.
So when I use an agent to write code, it's in languages I'm less familiar with, and often using libraries I know nothing about.
All to say, my part of the process often ends up being:
1. "Here's what I'm looking for, in detail" 2. "That's not right. Here's one way it's not right, and a specific example. Please fix that." 3. Sometimes I give suggestions for how what is going wrong might be happening, or conceptually how to work around the issue. 4. And iterate on 2-3 until the result is close enough.
That's a loop I'd love to automate.
So it depends really on the size of your project.
“There are already impressive examples of large automatic porting efforts, including the reported work around moving parts of Bun from Zig to Rust.” (Emphasis added.)
It will be impressive if/when the Bun team is able to pick up and continue extending and supporting Bun. For us, MS-DOC remains read-only and probably perpetually buggy until we reimplement with a better understanding. Until then, it’s definitely not “impressive”. Functional? Maybe. Impressive, no.
I don’t know if I like the current world without it though.
80% of different teams code the code is poorly tested. The code doesn’t handle data consistency or asynchronous code properly because the engineers don’t know better (and frankly don’t care enough).
Dependency handling is poorly managed leading to low quality operations with improper dashboards, alarms, and ops.
Badly managed processes leads to people doing monkey work signing off checklists rather than automation.
Frankly… why is keeping any of that good? It really pisses me off seeing people accept any of that low quality but that standard is the default and not the outlier.
Being an iOS engineer, much of my engineering cycle these days is going from Figma/PRD → spec → code. After being handed off to QA, we handle the bugs and product slips as they come through, while we simultaneously build/spec the upcoming addition. This is basically the same agile style that's been popular for 20y, just super-powered with agents.
How might someone accomplish the same goals using loops instead?
- An automation that periodically checks for PRD's at a given location that have not yet been implemented. - If it sees one not implemented, it puts a lock on it (so other agents later don't pick it up while its still working) and implements the PRD in code, assuming it has the figma link and all specs required. - When its done it makes a PR, waits for if it passes and even in some cases automatically merges into your staging/preview enironments and just pings you with a build/URL. You can then leave feedback or something and it can also also poll for pending feedback. Or you just mark it looks good, the agent then merges the PR, moves the PRD to implemented status, maybe even writes/updates docs and cleans up any temporary work. - Repeat checking for new PRD's every T unit time. (10 minutes, 1 hour, etc)
This is how people say you should be looping - you never even cared or looked at the code, and also never prompted the agent yourself.
But I find most agents are often pretty bad still at replicating UI vs making something from scratch and most design specs are still not as detailed around how things look at all sizes, in all scenarios etc. Design seems to be one of those things that still requires a human to validate. And then all the things the post author mentions about it not being willing to apply hard constraints, minimize impossible states, validate at edges and prevent horrendous overchecking of things. etc.
The loop is basically then a while loop:
While (tests fail) { trigger agent: spec, failures list }
for bugs, write failing tests.
Its basically TDD.
Loops do nothing useful beyond making the “spec -> code” step more “hands off” and let you be confident that the code you write does what is intended.
Obviously you see the issue: writing the loop harness is > effort than not having it…
…but the idea is that you run “spec first” and are totally hands off on the code, just updating the validation step and then waiting while the agent iterates over and over to solve for some solution that passes the loop harness.
People suggest that it is possible to go, eg. directly figma/jira to harness via (random tool here), saving even more time and invoking even fewer humans, but thats currently, as far as I can tell, actually just hype.
No one is actually doing that effectively.
Loops are currently carefully hand crafted, which makes them tedious and of questionable value imo.
https://gist.github.com/rcarmo/4922b550ab48bf0b4246c77e606a5...
Also:
> Now there is obviously a question if this desire to understand the code is one that I will still have a few years from now.
I do not think we should be having doubts like this. Either you consider understanding the code you ship and allowing your future self to be able to work on the system you're building to be a value, or you don't. I, for one, do, and I do not think using LLMs and coding agents will affect my point of view on that.
If you usually skip straight to the comments, you might want to actually read this one.
I've built up a skill harness and review flow that makes Opus generate slop-free code 90% of the time. But the remaining 10% requires me to stay at the helm. Especially in the early stages.
I would love to use loops to automate more, but I couldn't do it with the current generation models.
And on the back of my mind I'm still evaluating the possible future where we are forced to API pricing. I'm currently paying $400 for Opus, and use around 1.5-2 billion tokens per day. This will cost around $20k/m with API pricing. And I don't want to even imagine the possible scenario of getting locked out of frontier models because of politics.
Will the models get better to cut me out of the loop completely? I believe so. Will the open source models catch up tho SOTA models, and diversify from China-only? I hope so. Otherwise 2 superpowers will wield a soft power that can cripple the tech industries of all other countries.
The situation is more like: Altman & co are predicting their new car will replace all vehicles: horses, trains, planes, motorcycles, there’s a real possibility the concept of vehicles will not exist other than cars, in the future. Meanwhile it hasn’t really done highway speeds yet. It does some impressive runs on curated tracks, and people use it around their farms (it seems to work ok for some of them).
We’ll see, I guess.
If token costs are nil, then you can afford to run verification and generation through the same models. If token costs are high, then you will go broke verifying code sprawl.
Currently costs are (mostly) absent from the conversation, even though costs are what decide the limits which shape experience.
Also: Firms can be held liable for the products they sell, so if code cannot be reviewed then that code is essentially a law suit waiting to happen. I believe this is what customers will be demanding in the future: someone to hold accountable when things go wrong.
Ah ! This is me too... at least for what I have to ship at work. Not so much for my toy/weekend projects. But it turns out agents are also good at explaining.
Before someone else says it, no I don't read the assembly code that is produced by my compilers. However, I can generally predict what kind of assembly will be produced, and the result is deterministic unlike LLMs. It seems like most vibe coders scoff at the idea of even looking at the code, and it just seems untenable to me when we're working with (usually correct) stochastic parrots.
https://github.com/nfcampos/loop-dev/commit/e28b1fce0078e605...
I assume that GP was just saying that they would prefer to read these thoughts written by a human author (preferably you). I agree.
Edit: I just realized that Microsoft Word probably does do that now, and I hate it.
Currently my org of 8 people use around 1000 euro worth of tokens per month. We've recently had a discussion near the water-cooler, that if the cost climbs 5x-10x it may be just more worth it to get more developers (we're EU based). While the tools work and are definitely nice, even in our little org with our little budget, using Opus 4.8 we've noticed code quality going down.
If I had to bet money, I'd bet that the models will get 30-50% more nice, around 2x more expensive and we will settle into some mode where we'll use llms for some tasks, manually doing others and calling places focusing on speed at any cost some funny name like "gulags, 996, sweatshops, etc" and collectively try to somewhat avoid those places, which will need to offer a premium to attract talent. Wishful thinking.