I'm a big fan of sqlite-utils, but I really don't like how Python (particularly 3.12+) changes how sqlite's transactions work -- the native behavior explained in the sqlite docs is much better IMO. I understand why Python had to change it (to be compatible with other databases) but I don't think it's a good model for sqlite.
Therefore, I created apsw-utils, a port of sqlite-utils to the amazingly-awesome apsw lib -- which is a really idiomatic sqlite lib for python. It's here: https://answerdotai.github.io/apswutils/
I've used it in lots of projects including in significant production stuff, and it's always worked great for me. IMO if you're serious about doing sqlite in python, at some point you'll probably want to check out apsw.
The problem I have with this workflow is that the models are still too eager to please. If I ask it to scan a release and note possible issues, it absolutely will find issues. If I keep running the same prompt, it will keep finding issues. I’ve spammed GitHub PR reviews and it just keep finding (or inventing?) new issues. There is never a “Nothing found, good to go!”. I have to keep reminding myself that the model will always give me what I ask for, regardless of the reality/truth.
> There is never a “Nothing found, good to go!”. I have to keep reminding myself that the model will always give me what I ask for, regardless of the reality/truth.
Tell it something like:
Before doing any commits or producing a summary for the user, you must run a verification sub-agent.
Its goal is to adversarially and critically check your supposed findings to look out for false positives and hallucinations.
Doing so with a separate sub-agent with relatively clean context (but with all the relevant details of the problem space that appear to be facts) should improve our confidence in the findings.
Maybe also something like:
Try to classify each found issue as either SERIOUS, CRITICAL or NITPICK, discard nitpicks, we only care about impactful issues.
It should somewhat cut down on the useless output.
I've largely found the same in regards to generating code - the initial pass will often have bugs that the model itself can find but only when run as a separate sub-agent without the confidence poisoning in its own previous output.
You didn’t do it enough. They stop finding bugs eventually. Also, different models can find different bugs (though they do find the same ones, too, which is good and expected). For best results you want to run multi model reviews in loops.
If you had multiple people look at your PRs multiple times on different days results would be very similar.
It'll find a non-existent bug - fix it - figure out it broke a previously working thing - try to fix again - etc..
The "keep improving" the code base prompt have been tried and it never works. The LLM has no consciousness of where to stop and where to draw the lines of reasonableness.
No, depending on the complexity of the issue models can be into loops, where they go "this is definitely an issue and must be fixed", and then the resulting fixed code gets "this is definitely an issue and must be fixed", and then the resulting fixed code has the original 'issue'.
For a normal review loops you can ask the model to return with nothing found if nothing is found and not invent things and it will do a better job of exiting without anything found.
I get this sometimes when I ask the agent on GitHub to suggestion improvements to my Julia code. It's kind of fun to watch it struggle to please. I'm reminded of the old "Doctor" mode in Emacs.
You need to create review skill and there define what "issue" or "good" are for you to limit sensitiviness. Otherwise you depend on model's random threshold or non of such then you get perfection chasing.
Anyway it will never match your judgemend completely unless you upload your brain dump into model.
Like when you do recursive programming, have you tried providing more/better stop conditions? If you literally just say "Continue until there are no more issues" then it'll do just that, but if you scope it better, like "Only mention issues related to X, Y or that leads to Z" and so on, you'll get less noise and more focus on issues that actually matter (to you).
also helps adding negative conditions like "do not nitpick", or specific bad attractors that you see "do not investigate/report anything related to symlinks, they are not a concern"
If I keep running the same prompt, it will keep finding issues.
I've had the same experience, but whenever I've reviewed what it finds it's basically right. It's pedantic, and a lot of the problems aren't things I really care about, but they definitely are real problems.
I'm not sure you can blame the AI for always finding problems if a) you asked it to, and b) there are problems to find.
I use Claude Code and one of the steps in my workflow is do a review loop until no issues are found and it never loops. So my experience is entirely different. Even if I say: fix all issues. So not only the critical issues.
There is a point of diminishing returns though; the issues suggested will get speculative, or point out comment unclarity, or "defense in depth". But I agree it’s somewhat annoying to rarely get clear pushback in terms of "no, this looks good enough to me, release it"
Definitely not. I've never seen a human trapped in that kind of infinite loop. Humans know that if they don't stop at the end of the day, they don't get to go home to their wife, and if they don't finalize their list of issues, they never get their contract paid out.
Pay people per hour of work and even if there is no actual work, people will definitively find a way of spending hours doing things. If you've worked with contractors/outsourced roles before this will happen from time to time.
I think this was true with older models, but at least with GPT 5.5 it can genuinely tell you "no issues found" after a few passes of finding real issues.
Had this been a corporate environment the net saving by using one person partly and an agent as opposed to one person full time for the time it would take to implement this, would be a net saving enough to cover utilities, water and food for an entire village.
It’s silly to act like this was an added cost in a vacuum, or that any costs translate directly into charity for arbitrary families. Also in some place it would even cover rent for half a day.
The title cost is only if this was raw API usage, but it was included in a subscription, so it's a small subset of the $200 plan:
> I upgraded to the Claude Max $200/month plan (I was previously on $100/month) to increase my Fable allowance for the remaining time until the July 7th Fablepocalypse, when even Claude Max subscribers will have to pay full API cost for the model.
I really wonder if Anthropic will stick with their decision to keep Fable on extra usage credits until they "get more compute", especially in the light of GPT 5.6 very likely coming out next week (it's confirmed to have the exact same pricing as GPT 5.5)
> especially in the light of GPT 5.6 very likely coming out next week
Finally have an explanation why GPT 5.5 xhigh felt dumber and dumber these last few weeks, always the same thing when a new model release is about to come out...
I have never noticed a degradation in either Claude or OpenAI models, and the benchmarks people set up have never shown a statistically significant deviation either: https://marginlab.ai/trackers/claude-code
Yet the same claim is being posted every single day, including new claims that the Fable 5 model has degraded compared to the initial release, guardrails aside.
Almost slipping into conspiracy territory, but without insights into what the labs actually do internally, hard not to:
Anyways, heard about A/B testing before? ML people tend to like it a lot, hard to imagine neither OpenAI or Anthropic are already deep into categorizing people into buckets and running an wild amount of A/B testing all over the place, especially in the weeks leading up to new model releases, in various ways.
Yes, and we can see A/B testing on the ChatGPT website all the time.
They are also testing the new models in their coding tools with select customers first.
People working at OpenAI have publicly denied that they are performing any kind of hidden routing or quantization of models after release for Codex. I tend to believe them.
Glad to see others dual wielding: “I used to think that the idea of having one model review the work of another was somewhat absurd—it felt weirdly superstitious. The problem is it really does work”
Fun fact: because AI written works don't have copyright (in the EU at least) and the level of prompting many people engage in doesn't suffice to create a copyrightable "work" and software licenses require you to actually be able to grant a license using rights you hold on a work, not only are many AI generated "works" not actually protected by copyright but by selling licenses you're actually in breach of contract law and may end up owing the licensee software you don't have.
Therefore, I created apsw-utils, a port of sqlite-utils to the amazingly-awesome apsw lib -- which is a really idiomatic sqlite lib for python. It's here: https://answerdotai.github.io/apswutils/
I've used it in lots of projects including in significant production stuff, and it's always worked great for me. IMO if you're serious about doing sqlite in python, at some point you'll probably want to check out apsw.
Tell it something like:
Maybe also something like: It should somewhat cut down on the useless output.I've largely found the same in regards to generating code - the initial pass will often have bugs that the model itself can find but only when run as a separate sub-agent without the confidence poisoning in its own previous output.
If you had multiple people look at your PRs multiple times on different days results would be very similar.
It’s not perfect but usually it works pretty well, and I’ve had the model come back to me with oh actually the test passed, the bug doesn’t work exist
As a bonus, you’ve now got a test that can detect that bug if it comes up again.
The "keep improving" the code base prompt have been tried and it never works. The LLM has no consciousness of where to stop and where to draw the lines of reasonableness.
For a normal review loops you can ask the model to return with nothing found if nothing is found and not invent things and it will do a better job of exiting without anything found.
typically this means there is some ambiguity in the specification, and the model flips between alternative interpretations
Anyway it will never match your judgemend completely unless you upload your brain dump into model.
Like when you do recursive programming, have you tried providing more/better stop conditions? If you literally just say "Continue until there are no more issues" then it'll do just that, but if you scope it better, like "Only mention issues related to X, Y or that leads to Z" and so on, you'll get less noise and more focus on issues that actually matter (to you).
I've had the same experience, but whenever I've reviewed what it finds it's basically right. It's pedantic, and a lot of the problems aren't things I really care about, but they definitely are real problems.
I'm not sure you can blame the AI for always finding problems if a) you asked it to, and b) there are problems to find.
Would you like it to stop when there's still flaws in the code?
(The fixed prices are just temporary discounts)
It’s silly to act like this was an added cost in a vacuum, or that any costs translate directly into charity for arbitrary families. Also in some place it would even cover rent for half a day.
> I upgraded to the Claude Max $200/month plan (I was previously on $100/month) to increase my Fable allowance for the remaining time until the July 7th Fablepocalypse, when even Claude Max subscribers will have to pay full API cost for the model.
I really wonder if Anthropic will stick with their decision to keep Fable on extra usage credits until they "get more compute", especially in the light of GPT 5.6 very likely coming out next week (it's confirmed to have the exact same pricing as GPT 5.5)
Finally have an explanation why GPT 5.5 xhigh felt dumber and dumber these last few weeks, always the same thing when a new model release is about to come out...
Yet the same claim is being posted every single day, including new claims that the Fable 5 model has degraded compared to the initial release, guardrails aside.
Anyways, heard about A/B testing before? ML people tend to like it a lot, hard to imagine neither OpenAI or Anthropic are already deep into categorizing people into buckets and running an wild amount of A/B testing all over the place, especially in the weeks leading up to new model releases, in various ways.
They are also testing the new models in their coding tools with select customers first.
People working at OpenAI have publicly denied that they are performing any kind of hidden routing or quantization of models after release for Codex. I tend to believe them.
So obviously people are going to take their lead and not get legal advice from some greasy dweeb at the bottom of HN.
- Narrator