One consideration not mentioned is around developer sophistication. Steve alludes to the expansion effect of CodeGen ("there are millions and maybe billions who are jumping at the chance to code"), but doesn't consider that the vast majority of these people don't know about arrays, data structures, memory, containers, runtimes, etc, etc...
To me, that's the most important consideration here. Are you targeting professional devs who are enhancing their current workflows iteratively with these improvements? Or re-thinking from the ground up, obfuscating most of what we've learned to date?
Maybe we need to trudge through all of these weeds until software creation hits its final, elegant form where "Anyone Can Code".
Maybe the old Gusteau quote is actually fitting here:
"You must be imaginative, strong-hearted. You must try things that may not work, and you must not let anyone define your limits because of where you come from. Your only limit is your soul. What I say is true - anyone can ̶c̶o̶o̶k̶ code... but only the fearless can be great."
Well we'll never reach a state where anyone can code. I have pans, a supermarket nearby, cookbooks and a belly, still I'm never gonna be able to cook, I snooze after 30 minutes, even if I succeed once, I get bored and stop for months etc.
Simplifying to the point a grandma could make an app isn't gonna make any grandma WANT to make apps. And that's fine, there's no issue, we don't have to make more people code and those who want, will, even if all we had was assembly and a light board...
Which I think is the spirit of your quote basically.
That’s a bad comparison, cooking has been done by people for thousands of years, your problem with cooking is laziness, there is nothing mentally or physically stopping your from learning to cook.
I do agree with your second paragraph and it’s more that you DON’T want to cook versus you being unable to cook.
Thats a bad comparison, coding has been done by people for thousands of man-years, your problem with coding is laziness, there is nothing mentally or physically stopping you from learning to code.
I do agree with your second paragraph and it’s more that you DON’T want to code versus you being unable to code.
This is pithy but won't. There's a clear difference between software (no evidence that everyone can do it capably) and cooking (thousands of years of nigh everyone from every background doing so). Laundering it as "thousands of man hours" doesn't change the fact that we've had less than a century of evidence for people coding, and for most of that only a small subsection of the population has picked it up.
Windsurf + Haskell w/ CLI tools has been pretty amazing. Windsurf's agent will loop for minutes on its own to figure out the right structure of a program. You just need to tell it to:
- use the hoogle cli to search for the right types and functions
Woah, that sounds awesome! I'd love to see how you set that up and how much it can do without your intervention/approval for various actions. Might you have a video of your workflow that you could share?
What's the maximum file size for which this is useful in your experience? I have been refactoring some project solely to enable AI code editors to edit it. Some users in the discord suggest a maximum file size of 500LOC or small, which seems unreasonable.
> The next big thing was Cursor. I must admit that I never personally fell in love with it, but given how many people I respect love it, I think that’s a me-problem
I've met so many engineers who have said exactly this. There are clearly some group of people obsessed with Cursor, but it's interesting to me how alien they seem to the majority of people using ai codegen right now.
i had to uninstall it because it had associated itself with every possible file extension. i couldn't open a file without cursor popping up. very horrifying for that to happen to my computer when working on important projects
On Linux, I have the opposite issue. I ended up hard symlinking cursor to VS Code because Cursor wasn't opening despite being set as the default editor.
I prefer vscode+copilot. It's much cheaper and has all the functionality I want. There's access to 3.5 sonnet, and it can edit/create up to 10 files at a time.
I'm pretty much full time on cursor from vscode. I don't trust it for big code blocks, but a control-k + reasonable command (I could have typed up myself) is saving me quite a bit of time.
This blog article is written in a very engaging way. It seems to be more or less a masterclass on how to keep someone's attention, although there is no meta-story making you wait for the big fulfillment at the end.
I think it is the short, punchy sections with plenty of visuals and the fact that you are telling a story the whole way through, which has a natural flow, each experiment you describe, leading to the next.
For the lmarena leaderboard to be really useful you need click the "Style Control" button so that it normalizes for LLMs that generate longer answers, etc. that, while humans may find them more stylistically pleasing, and upvote them, the answers often end up being worse. When you do that, o1 comes out on top followed by o1-preview, then Sonnet 3.5, and in fourth place Gemini Preview 1206.
It's really interesting seeing the progression here, integrating AI-assisted coding tools into something like Val Town is a great arena for exploring different patterns for this stuff.
Worth checking out their Cerebras-powered demo too - LLMs at 2000 tokens/second make applying proposed changes absurdly interactive: https://cerebrascoder.com/
I wonder if you didn't try cursor's Composer tab, especially set to Agent?
I didn't care that much for cursor when I was just using Chat but once I switched to Composer I was very happy, and my experience is in total disagreement that it's not so good for smaller projects.
They also must have a good prompt for diff-based completions, I don't know how hard it is to extract that.
> The biggest problem with all current codegen systems is the speed of generation
I don't see this complained about nearly as much as I'd expect. Groq has been out for over a year, I'm surprised OpenAI not acquired them and figured out how to 10x to 20x their speed on gpt4.
Yeah I don't agree. I'm building a product in the space, and the number one problem is correctness, not latency.
People are very happy to sit there for minutes if the correctness is high and the quality is high. It's still 100x or 1000x faster than finding 3rd party developers to work for you.
I wish the models were getting better but recently they've felt very stuck and this is it, so agent architectures will be the answer in the short term. That's what's working for us at srcbook rn.
I think the logic behind faster inference is that the LLM is unlikely to get it right the first time regardless of its intelligence simply due to the inherent ambiguity of human language. The faster it spits out a semi-functional bit of code the faster the iteration loop and the faster the user gets what they want.
Plus if you’re dealing with things like syntax errors, a really really fast llm + interpreter could report and fix the error in less than a minute with no user input.
Also building something in this space. I think it’s a mistake to compare the speed of LLMs to humans. People don’t like to sit and wait. The more context you can give the better but at some point (>30 seconds) people grow tired of waiting.
I'm interested in what stopped you from finishing diffs and diff based editing. I built an AI software engineering assistant at my last company and we got decent results with Aider's method (and prompts, and hidden conversation starter etc). I did have to have a fallback to raw output, and a way to ask it to try again. But for the most part it worked well and unlocked editing large files (and quickly).
Excellent question! We just didn't have the resources at the time on our small team to invest in getting it to be good enough to be default on. We had to move on to other more core platform features.
Though I'm really eager to get back to it. When using Windsurf last week, I was impressed by their diffs on Sonnet. Seems like they work well. I would love to view their system prompt!
I hope that when we have time to resume work on this (maybe in Feb) that we'll be able to get it done. But then again, maybe just patience (and more fast-following) is the right strategy, given how fast things are moving...
An interesting alternative to diffs appears to be straightforward find and replace.
Claude Artifacts uses that: they have a tool where the LLM can say "replace this exact text with this" to update an Artifat without having to output the whole thing again.
I think this is going to be the answer eventually.
Once one of the AI companies figures out a decent (probably treesitter-based) language to express code selections and code changes in, and then trains a good model on it, they're going to blow everyone else out of the water.
This would help with "context management" tremendously, as it would let the LLM ask for things like "all functions that are callers of this function", without having to load in entire files. Some simpler refactorings could also be performed by just writing smart queries.
Aider actually prompts the LLM to use search/replace blocks rather than actual diffs. And then has a bunch of regex, fuzzy search, indent fixing etc code to handle inconsistent respnses.
Aider's author has a bunch of benchmarks and found this to work best with modern models.
Oh that is super interesting! I wonder if they track how often it succeeds in matching and replacing, I'd love to see those numbers in aggregate.
Total anecdote, but I worked on this for a bit for a research-level-code code editor (system paper to come soon, fingers crossed!) and found that basic find-and-replace was pretty brittle. I also had to be confident the source appears only once (not always the case for my use case), and there was a tradeoff of fuzziness of match / likelihood of perfectly correct source.
But yeah, diffs are super hard because the format requires far context and accurate mathematical computation.
Ultimately, the version of this that worked the best for me was a total hack:
Prefix every line of the code with L#### -- the line number. Ask for diffs to be the original text and the complete replacement text including the line number prefix on both original and replacement. Then, to apply, fuzzy match on both line number and context.
I suspect this worked as well as it did because it transmutes the math and computation problems into pattern-matching and copying problems, which LLMs are (still) much better at these days.
I suspect any other "hook" would work just as well, a comment with a nonce--and could serve as block boundaries to make changes more likely to be complete?
This is actually a very powerful pattern that everybody building with LLMs should pay attention to, especially when combined with structured outputs (AKA JSON mode).
If you want an LLM to refer to a specific piece of text, give each one an ID and then work with those IDs.
What we found was that error handling on the client side was also very important. There's a bunch of that in Aider too for inspiration. Fuzzy search, indent fixing, that kind of stuff.
And also just to clarify, aider landed on search/replace blocks for gpt-4o and claude rather than actual diffs. We followed suit. And then we showed those in a diff UI client side
> The reality is that those tools are POWER TOOLS best used by engineers very well versed in the domain and in coding itself.
I'm starting to get a feeling of dread that our entire engineering organization is digging itself into a hole with lots of buggy code being written which no one seems to understand, presumably written with heavy LLM assistance. Our team seems to be failing to deliver more, and quality has seemingly worsened, despite leaning in to these tools.
Reading hacker news gives me the idea that LLMs are a miracle panacea, a true silver bullet. I think that the positive stories I hear on hacker news goes through a big selection bias. It has always been the motivated people who always utilized their tools to their best ability.
I definitely don't consider myself to be good in this regard either and struggle to use LLM tools effectively. Most of the time I would be happy with myself if I could just have a solid mental understanding of what the codebase is doing, never mind be a 10x AI enhanced developer.
> have a solid mental understanding of what the codebase is doing
I think this is what truly matters no matter how or even if you're slinging code. I think this is what makes highly effective folks and also cleanly explains why high performers in one team or org can fail to deliver in another company or position.
Sorry for the late (and now in wrong thread) reply; is https://news.ycombinator.com/item?id=42318876 still active?
If so, happy to have a chat on it. If you want, let me know how to contact you best (feel free to send me mail to d10.pw).
Val town has been a huge inspiration for the tscircuit site which is basically a typescript playground for electronic design. We extensively use codemirror-ts which was created by Val town and enables typescript autocomplete inside a codemirror editor. I didn’t know about codemirror-codeium but ill definitely look at integrating that as well!
It’s absolutely true that we are in a race for online editors, I feel fatigued competing for ai features instead of building core product features, but since my framework is new, it’s not known by any major LLM providers, so our users can’t get ai assistance unless we build something ourselves.
@stevekrouse huge shout out for your team’s open source work, hoping to help contribute upstream at some point!!
> For starters, we could feed back screenshots of the generated website back to the LLM. But soon you’d want to give the LLM access to a full web browser so it can itself poke around the app, like a human would, to see what features work and which ones don’t.
We've had some success[1] with the screenshot to actions - using Gemini/Molmo and ADB on phones. And human like decisions was made by GPT 4o. It also recalibrates itself and says "oh we are still at the home screen, let's find the gmail app first"
Kinda got the the same conclusion than OP building in the same space. There is so much innovation going on currently that whatever you do today, two other people will do better tomorrow. Which is a good news for us but difficult time for builders.
Ditto. This was my conclusion after spending a bit of time building https://robocoder.app/
Coupled with the fact that devs prefer open source tools and are capable of (and often prefer) making their own tooling, it never seemed like a great market to go after. I also encountered a lot of hostility trying to share yet another AI developer tool.
(Note I am one of those developers who prefer open source tools — which should’ve been a hint…)
goodnews is the market is still too early, a lot of people still dont know these things exist. as long as you keep showing up you're gonna get a piece of the pie
"Is this fast-following competitive or is it collaborative? So far it’s been feeling mostly collaborative. The pie is so freaking large — there are millions and maybe billions who are jumping at the chance to code — that we’re all happy to help each other scramble to keep up with the demand."
The pie for whom? For drug dealers who give power users their LLM fix so they feel smart and can fake it?
The pie is certainly shrinking for software engineers, as evidenced by the layoffs. Cocky startup founders may be next.
it can do it already, the trick is to prompt it to approach it how a human would.
1. use a temp file as a reference for the entire refactor
2. make it plan the entire thing, tell it to use a high level and low level checklists, tell it to take notes for itself, and tell it to use the temp file as a scratchpad for taking notes and storing code blocks.
3. tell it to do small incremental changes, and do bottoms up approach.
I find that even CoPilot can do this pretty quickly if you do one example refactor and then prompt it to repeat the example on all the files from a find and search.
Since that's already a huge speed up, I'm sure many of these agents can do the same.
I’m a val.town user and townie has been really nice in conjunction with having stuff working and hosted right away, it hits the sweet spot for speed and flexibility. Tough call to make on whether to continue pursuing it, excited to see what you do!
To me, that's the most important consideration here. Are you targeting professional devs who are enhancing their current workflows iteratively with these improvements? Or re-thinking from the ground up, obfuscating most of what we've learned to date?
Maybe we need to trudge through all of these weeds until software creation hits its final, elegant form where "Anyone Can Code".
Maybe the old Gusteau quote is actually fitting here:
"You must be imaginative, strong-hearted. You must try things that may not work, and you must not let anyone define your limits because of where you come from. Your only limit is your soul. What I say is true - anyone can ̶c̶o̶o̶k̶ code... but only the fearless can be great."
Simplifying to the point a grandma could make an app isn't gonna make any grandma WANT to make apps. And that's fine, there's no issue, we don't have to make more people code and those who want, will, even if all we had was assembly and a light board...
Which I think is the spirit of your quote basically.
I do agree with your second paragraph and it’s more that you DON’T want to cook versus you being unable to cook.
I do agree with your second paragraph and it’s more that you DON’T want to code versus you being unable to code.
:)
- use the hoogle cli to search for the right types and functions
- include a comprehensive test suite
- run a build after every code change
- run tests after every successful build
GHC + a Claude-based agent is a thing to behold.
I've met so many engineers who have said exactly this. There are clearly some group of people obsessed with Cursor, but it's interesting to me how alien they seem to the majority of people using ai codegen right now.
"code" used to fire VS Code.
My rationale is: what else do they think they can get away with in my system?
VS Code Copilot Chat with #codebase in prompt has the edit mode which behaves similar to Cursor. Even more so with o1-preview selected.
Our first had a nice discussion on HN: https://blog.val.town/blog/codegen/
The other posts in the series:
- https://blog.val.town/blog/townie/
- https://blog.val.town/blog/building-a-code-writing-robot/
I think it is the short, punchy sections with plenty of visuals and the fact that you are telling a story the whole way through, which has a natural flow, each experiment you describe, leading to the next.
https://lmarena.ai/?leaderboard
Worth checking out their Cerebras-powered demo too - LLMs at 2000 tokens/second make applying proposed changes absurdly interactive: https://cerebrascoder.com/
I didn't care that much for cursor when I was just using Chat but once I switched to Composer I was very happy, and my experience is in total disagreement that it's not so good for smaller projects.
They also must have a good prompt for diff-based completions, I don't know how hard it is to extract that.
Yes, I wonder if reilly3000 will swing by with a leaked system prompt from them too
I don't see this complained about nearly as much as I'd expect. Groq has been out for over a year, I'm surprised OpenAI not acquired them and figured out how to 10x to 20x their speed on gpt4.
People are very happy to sit there for minutes if the correctness is high and the quality is high. It's still 100x or 1000x faster than finding 3rd party developers to work for you.
I wish the models were getting better but recently they've felt very stuck and this is it, so agent architectures will be the answer in the short term. That's what's working for us at srcbook rn.
Plus if you’re dealing with things like syntax errors, a really really fast llm + interpreter could report and fix the error in less than a minute with no user input.
Though I'm really eager to get back to it. When using Windsurf last week, I was impressed by their diffs on Sonnet. Seems like they work well. I would love to view their system prompt!
I hope that when we have time to resume work on this (maybe in Feb) that we'll be able to get it done. But then again, maybe just patience (and more fast-following) is the right strategy, given how fast things are moving...
https://www.reddit.com/r/LocalLLaMA/comments/1h7sjyt/windsur...
Claude Artifacts uses that: they have a tool where the LLM can say "replace this exact text with this" to update an Artifat without having to output the whole thing again.
ChatGPT's new Canvas feature apparently does a more sophisticated version of that using regular expressions as opposed to simple text matching: https://twitter.com/sh_reya/status/1875227816993943823
Once one of the AI companies figures out a decent (probably treesitter-based) language to express code selections and code changes in, and then trains a good model on it, they're going to blow everyone else out of the water.
This would help with "context management" tremendously, as it would let the LLM ask for things like "all functions that are callers of this function", without having to load in entire files. Some simpler refactorings could also be performed by just writing smart queries.
Aider's author has a bunch of benchmarks and found this to work best with modern models.
Total anecdote, but I worked on this for a bit for a research-level-code code editor (system paper to come soon, fingers crossed!) and found that basic find-and-replace was pretty brittle. I also had to be confident the source appears only once (not always the case for my use case), and there was a tradeoff of fuzziness of match / likelihood of perfectly correct source.
But yeah, diffs are super hard because the format requires far context and accurate mathematical computation.
Ultimately, the version of this that worked the best for me was a total hack:
Prefix every line of the code with L#### -- the line number. Ask for diffs to be the original text and the complete replacement text including the line number prefix on both original and replacement. Then, to apply, fuzzy match on both line number and context.
I suspect this worked as well as it did because it transmutes the math and computation problems into pattern-matching and copying problems, which LLMs are (still) much better at these days.
Graphologue used a version of this too: https://hci.ucsd.edu/papers/graphologue.pdf
If you want an LLM to refer to a specific piece of text, give each one an ID and then work with those IDs.
And also just to clarify, aider landed on search/replace blocks for gpt-4o and claude rather than actual diffs. We followed suit. And then we showed those in a diff UI client side
I have started using AI coding assistant and I am not looking back.
This comes from an engineer that KEEP telling the junior on his team to NOT use GenAI.
The reality is that those tools are POWER TOOLS best used by engineers very well versed in the domain and in coding itself.
For them, it is really a huge time saving. The work is more like approving PR for a quite competent engineer than writing the PR myself.
My tool of choice is Cline, that is great, but not perfect.
And the quality is 100% correlated to:
1. The model
2. The context window
3. How well I prompt it.
In reverse order of importance.
Even an ok model, well prompted gives you a satisfactory code.
I'm starting to get a feeling of dread that our entire engineering organization is digging itself into a hole with lots of buggy code being written which no one seems to understand, presumably written with heavy LLM assistance. Our team seems to be failing to deliver more, and quality has seemingly worsened, despite leaning in to these tools.
Reading hacker news gives me the idea that LLMs are a miracle panacea, a true silver bullet. I think that the positive stories I hear on hacker news goes through a big selection bias. It has always been the motivated people who always utilized their tools to their best ability.
I definitely don't consider myself to be good in this regard either and struggle to use LLM tools effectively. Most of the time I would be happy with myself if I could just have a solid mental understanding of what the codebase is doing, never mind be a 10x AI enhanced developer.
But it is much faster.
When I use AI I need to continuously review, direct, and manage the AI.
I go through every change and I agree with them, updates nits and regenerate code that is not up to par with a better or more specific prompt.
Not doing this exercise is disastrous for the codebase.
It really explodes in complexity in no time.
Moreover it always try to fix error with more code. Not with better code.
I think this is what truly matters no matter how or even if you're slinging code. I think this is what makes highly effective folks and also cleanly explains why high performers in one team or org can fail to deliver in another company or position.
https://slowtechred.substack.com/publish/posts/detail/154138...
It’s absolutely true that we are in a race for online editors, I feel fatigued competing for ai features instead of building core product features, but since my framework is new, it’s not known by any major LLM providers, so our users can’t get ai assistance unless we build something ourselves.
@stevekrouse huge shout out for your team’s open source work, hoping to help contribute upstream at some point!!
We've had some success[1] with the screenshot to actions - using Gemini/Molmo and ADB on phones. And human like decisions was made by GPT 4o. It also recalibrates itself and says "oh we are still at the home screen, let's find the gmail app first"
1. https://github.com/BandarLabs/clickclickclick - Letting AI control/use my phone.
Coupled with the fact that devs prefer open source tools and are capable of (and often prefer) making their own tooling, it never seemed like a great market to go after. I also encountered a lot of hostility trying to share yet another AI developer tool.
(Note I am one of those developers who prefer open source tools — which should’ve been a hint…)
The pie for whom? For drug dealers who give power users their LLM fix so they feel smart and can fake it?
The pie is certainly shrinking for software engineers, as evidenced by the layoffs. Cocky startup founders may be next.
1. use a temp file as a reference for the entire refactor
2. make it plan the entire thing, tell it to use a high level and low level checklists, tell it to take notes for itself, and tell it to use the temp file as a scratchpad for taking notes and storing code blocks.
3. tell it to do small incremental changes, and do bottoms up approach.
Since that's already a huge speed up, I'm sure many of these agents can do the same.