I was using this and superpowers but eventually, Plan mode became enough and I prefer to steer Claude Code myself. These frameworks are great for fire-and-forget tasks, especially when there is some research involved but they burn 10x more tokens, in my experience. I was always hitting the Max plan limits for no discernable benefit in the outcomes I was getting. But this will vary a lot depending on how people prefer to work.
I've gone the other way recently, shifting from pure plan mode to superpowers. I was reminded of it due to the announcement of the latest version.
It is perhaps confirmation bias on my part but I've been finding it's doing a better job with similar problems than I was getting with base plan mode. I've been attributing this to its multiple layers of cross checks and self-reviews. Yes, I could do that by hand of course, but I find superpowers is automating what I was already trying to accomplish in this regard.
Yes, it does help in that way. Maybe I'm still struggling to let go and let AI take the wheel from beginning to end but I enjoy the exploratory part of the whole process (investigating possible solutions, trying theories, doing little spikes, etc, all with CC's assistance). When it's time to actually code, I just let it do its own thing mostly unsupervised. I do spend quite a lot of time on spec writing.
That’s part of what I’ve liked about it over plan mode. Again not a scientific measurement but I feel it’s better at interactive brainstorming and researching the big picture with me. And it’s built in multiple checkpoints also give me more space to pivot or course correct.
Why are we using cli wrappers if you're using Claude Code? I get if you need something like Codex but they released sub agents today so maybe not even that, but it's an unnecessary wrapper for Claude Code.
I've had a good experience with https://github.com/obra/superpowers. At first glance this looks similar. Has anyone tried both who can offer a comparison?
I tried Superpowers for my current project - migrating my blog from Hugo to Astro (with AstroPaper theme). I wrote the main spec in two ways - 1) my usual method of starting with a small list of what I want in the new blog and working with the agent to expand on it, ask questions and so on (aka Collaborative Spec) and 2) asked Superpowers to write the spec and plan. I did both from the working directory of my blog's repo so that the agent has full access to the code and the content.
My findings:
1. The spec created by Superpowers was very detailed (described the specific fonts, color palette), included the exact content of config files, commit messages etc. But it missed a lot of things like analytics, RSS feed etc.
2. Superpowers wrote the spec and plan as two separate documents which was better than the collaborative method, which put both into one document.
3. Superpowers recommended an in-place migration of the blog whereas the collaborative spec suggested a parallel branch so that Hugo and Astro can co-exist until everything is stable.
And a few more difference written in [0].
In general, I liked the aspect of developing the spec through discussion rather than one-shotting it, it let me add things to the spec as I remember them. It felt like a more iterative discovery process vs. you need to get everything right the first time. That might just be a personal preference though.
At the end of this exercise, I asked Claude to review both specs in detail, it found a few things that both specs missed (SEO, rollback plan etc.) and made a final spec that consolidates everything.
I've used both
From my experience, gsd is a highly overengineered piece of software that unfortunately does not get shit done, burns limits and takes ages while doing so. Quick mode does not really help because it kills the point of gsd, you can't build full software on ad-hocs. I've used plain markdown planning before, but it was limiting and not very stable, superpowers looks like a good middleground
I've tried both. Each has pros and cons. Two things I don't like about superpowers is it writes all the codes into the implementation plan, at the plan step, then the subagents basically just rewrite these codes back to the files. And I have to ask Claude to create a progress.md file to track the progress if I want to work in multiple sessions. GSD pretty much solved these problems for me, but the down side of GSD is it takes too many turns to get something done.
It's one of those things where having a structure is really helpful - I've used some similar prompt scaffolds, and the difference is very noticeable.
Another great technique is to use one of these structures in a repo, then task your AI with overhauling the framework using best practices for whatever your target project is. It works great for creative writing, humanizing, songwriting, technical/scientific domains, and so on. In conjunction with agents, these are excellent to have.
I think they're going to be a temporary thing - a hack that boosts utility for a few model releases until there's sufficient successful use cases in the training data that models can just do this sort of thing really well without all the extra prompting.
Apart from GSD and superpowers, there's another system, called PAUL [1]. It apparently requires fewer tokens compared to GSD, as it does not use subagents, but keeps all in one session.
A detailed comparison with GSD is part of the repo [2].
Another heavily overengineered AND underengineered abomination. I'm convinced anyone who advocates for these types of tools would find just as much success just prompting claude code normally and taking a little bit to plan first. Such a waste of time to bother with these tools that solve a problem that never existed in the first place.
GSD has a reputation for being a token burner compared to something like Superpowers. Has that changed lately? Always open to revisiting things as they improve.
The whole gsd/agents folder is hilarious. Like a bunch of MD that never breaks. How do you is it minimally correct? Subjective prose. Sad to see this on the frontpage
This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.
These are incredible new superpowers. The LLMs let us do far far more than we could before. But it creates information glut, doesn't come with in built guards to prevent devolution from setting in. It feels unsurprising but also notable that a third of what folks are suddenly building is harness/prompting/coordination systems, because it's all trying to adapt & figure out process shapes for using these new superpowers well in.
There's some VC money interest but I'd classify more than 9 / 10ths of it as good old fashioned wildcat open source interest. Because it's fascinating and amazing, because it helps us direct our attention & steer our works.
And also it's so much more approachable and interesting, now that it's all tmux terminal stuff. It's so much more direct & hackable than, say, wading into vscode extension building, deep in someone else's brambly thicket of APIs, and where the skeleton is already in place anyhow, where you are only grafting little panes onto the experience rather than recasting the experience. The devs suddenly don't need or care for or want that monolithic big UI, and have new soaring freedom to explore something much nearer to them, much more direct, and much more malleable: the terminal.
250K lines in a month — okay, but what does review actually look like at that volume?
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
Saying "I generated 250k lines" is like saying "I used 2500 gallons of gas". Cool, nice expense, but where did you get? Because it it's three miles, you're just burning money.
250k lines is roughly SQLite or Redis in project size. Do you have SQLite-maintaining money? Did you get as far as Redis did in outcomes?
You can AI to audit and review. You can put constraints that credentials should never hit disk. In my case, AI uses sed to read my env files, so the credentials don't even show up in the chat.
Things have changed quite a bit. I hope you give GSD a try yourself.
In my experience the issue is that when the same agent writes and reviews its own code it'll always think it's fine. I've been running a setup where the coder and reviewer are completely separate - different models, reviewer doesn't see any of the coder's context, just the spec and final output. catches way more stuff than i expected honestly.
Sorry about that. I'm new here and English isn't my first language, so I leaned on tools to help me phrase things and it ended up looking like a bot. Lesson learned-I'll stick to my own words from now on. The point is real though. I've actually been building a multi-agent system and that separation between coder and reviewer is a game changer for catching bugs that look fine on the surface. Anyway, won't happen again.
I like openspec, it lets you tune the workflow to your liking and doesn’t get in the way.
I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.
I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.
I think these type of systems (gsd/superpowers) are way too opinionated.
It's not that they can't or don't work. I just think that the best way to truly stay on top of the crazy pace of changes is to not attach yourself to super opinionated workflows like these.
I'm building an orchestrator library on top of openspec for that reason.
I could not produce useful output from this. It was useful as a rubber duck because it asks good motivating questions during the plan phase, but the actual implementation was lacklustre and not worth the effort. In the end, I just have Claude Opus create plans, and then I have it write them to memory and update it as it goes along and the output is better.
I don't know brother, I don't use them, they may be great they may suck. What I've found is that adding peripherals always creates more problems. If you aren't using Claude for professional work then just sticking with the factory plan mode probably works. If not, look into creating your own Claude skills, try to understand how prompt pipelines work and it will unlock a ton of automation for you. Not just for coding.
I've been using GSD extensively over the past 3 months. I previously used speckit, which I found lacking. GSD consistently gets me 95% of the way there on complex tasks. That's amazing. The last 5% is mostly "manual" testing. We've used GSD to build and launch a SaaS product including an agent-first CMS (whiteboar.it).
It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.
No idea but doesn’t it sound GREAT and filled with portentous meaning? Don’t be an enterprise clown! Be a gutsy hustle guy like me! Down with enterprise theatre, long live the vibe jam!
Seems fairly obvious: Some agent harnesses play enterprise theater by creating jira-type tickets for you and moving them around silly swim lanes, instead of, of course, just simply getting sh!t done.
I've tried it, and I'm not convinced I got measurably better results than just prompting claude code directly.
It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.
it is very hard for me to take seriously any system that is not proven for shipping production code in complex codebases that have been around for a while.
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
1. Backend unit tests — fast in-memory tests that run the full suite in ~5 seconds on every save.
2. Full end-to-end tests — automated UI tests that spin up a real cloud server, run through the entire user journey (provision → connect → manage → teardown), and
verify the app behaves correctly on all supported platforms (phone, tablet, desktop).
3. Screenshot regression tests — every E2E run captures named screenshots and diffs them against saved baselines. Any unintended UI change gets caught
automatically.
I was not a app developer before, but a systems engineer with devops experience. But I learnt a lot about apple development, app store connect and essential became a app developer in a month. I don't think I can learn so quickly with other humans help.
You might be surprised. In 2008, when the App Store first came out, I became an iPhone app developer after reading one book. I already knew C, so Objective C wasn't a big leap.
Between my own apps and consulting work, I had a pretty good side business. Like everything else though, those days didn't last forever. But there was a lot of easy money early on.
A self-hosted VPN server manager: a TypeScript/Hono backend that runs on your own VPS, paired with a SwiftUI iOS/macOS app. It lets you provision cloud servers across multiple providers (Hetzner, DigitalOcean, Vultr), manage them via a Tailscale-secured connection with TLS pinning, and control an OpenClaw gateway.
I will open source it soon in few weeks, as I have still complete few more features.
It's important to build a local dev environment that GSD can iterate on. Once I have done that, I just discuss with GSD and few hours later features land.
The README recommends --dangerously-skip-permissions as the intended workflow. Looking at gsd-executor.md you can see why — subagents run node gsd-tools.cjs, git checkout -b, eslint, test runners, all generated dynamically by the planner. Approving each one kills autonomous mode.
There is a gsd-plan-checker that runs before execution, but it only verifies logical completeness — requirement coverage, dependency graphs, context budget. It never looks at what commands will actually run. So if the planner generates something destructive, the plan-checker won't catch it because that's not what it checks for. The gsd-verifier runs after execution, checking whether the goal was achieved, not whether anything bad happened along the way. In /gsd:autonomous this chains across all remaining phases unattended.
The granular permissions fallback in the README only covers safe reads and git ops — but the executor needs way more than that to actually function. Feels like there should be a permission profile scoped to what GSD actually needs without going full skip.
It is perhaps confirmation bias on my part but I've been finding it's doing a better job with similar problems than I was getting with base plan mode. I've been attributing this to its multiple layers of cross checks and self-reviews. Yes, I could do that by hand of course, but I find superpowers is automating what I was already trying to accomplish in this regard.
My findings:
1. The spec created by Superpowers was very detailed (described the specific fonts, color palette), included the exact content of config files, commit messages etc. But it missed a lot of things like analytics, RSS feed etc.
2. Superpowers wrote the spec and plan as two separate documents which was better than the collaborative method, which put both into one document.
3. Superpowers recommended an in-place migration of the blog whereas the collaborative spec suggested a parallel branch so that Hugo and Astro can co-exist until everything is stable.
And a few more difference written in [0].
In general, I liked the aspect of developing the spec through discussion rather than one-shotting it, it let me add things to the spec as I remember them. It felt like a more iterative discovery process vs. you need to get everything right the first time. That might just be a personal preference though.
At the end of this exercise, I asked Claude to review both specs in detail, it found a few things that both specs missed (SEO, rollback plan etc.) and made a final spec that consolidates everything.
[0] https://annjose.com/redesign/#two-specs-one-project
Another great technique is to use one of these structures in a repo, then task your AI with overhauling the framework using best practices for whatever your target project is. It works great for creative writing, humanizing, songwriting, technical/scientific domains, and so on. In conjunction with agents, these are excellent to have.
I think they're going to be a temporary thing - a hack that boosts utility for a few model releases until there's sufficient successful use cases in the training data that models can just do this sort of thing really well without all the extra prompting.
These are fun to use.
Superpowers and gsd are claude code plugins (providing skills)
Get Shit Done is best when when you're an influencer and need to create a Potemkin SaaS overnight for tomorrow's TikTok posts.
[1] https://github.com/ChristopherKahler/paul
[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...
This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.
There's some VC money interest but I'd classify more than 9 / 10ths of it as good old fashioned wildcat open source interest. Because it's fascinating and amazing, because it helps us direct our attention & steer our works.
And also it's so much more approachable and interesting, now that it's all tmux terminal stuff. It's so much more direct & hackable than, say, wading into vscode extension building, deep in someone else's brambly thicket of APIs, and where the skeleton is already in place anyhow, where you are only grafting little panes onto the experience rather than recasting the experience. The devs suddenly don't need or care for or want that monolithic big UI, and have new soaring freedom to explore something much nearer to them, much more direct, and much more malleable: the terminal.
There's so many different forms of this happening all at once. Totally different topic, but still in the same broad area, submitted just now too: Horizon, an infinite canvas for trrminals/AI work. https://github.com/peters/horizon https://news.ycombinator.com/item?id=47416227
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
Saying "I generated 250k lines" is like saying "I used 2500 gallons of gas". Cool, nice expense, but where did you get? Because it it's three miles, you're just burning money.
250k lines is roughly SQLite or Redis in project size. Do you have SQLite-maintaining money? Did you get as far as Redis did in outcomes?
Things have changed quite a bit. I hope you give GSD a try yourself.
I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.
I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.
I think these type of systems (gsd/superpowers) are way too opinionated.
It's not that they can't or don't work. I just think that the best way to truly stay on top of the crazy pace of changes is to not attach yourself to super opinionated workflows like these.
I'm building an orchestrator library on top of openspec for that reason.
It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.
But I guess if I go by what you’re saying I suppose it makes sense for it not to do a bunch of things you didn’t ask it to do.
It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
I got a promotion once for deleting 250K lines of code in less than a month. Now that sounds better
Faster than using ai. Cheaper. Code is better tested/more secure. I can learn/build with other humans.
Between my own apps and consulting work, I had a pretty good side business. Like everything else though, those days didn't last forever. But there was a lot of easy money early on.
I will open source it soon in few weeks, as I have still complete few more features.
There is a gsd-plan-checker that runs before execution, but it only verifies logical completeness — requirement coverage, dependency graphs, context budget. It never looks at what commands will actually run. So if the planner generates something destructive, the plan-checker won't catch it because that's not what it checks for. The gsd-verifier runs after execution, checking whether the goal was achieved, not whether anything bad happened along the way. In /gsd:autonomous this chains across all remaining phases unattended.
The granular permissions fallback in the README only covers safe reads and git ops — but the executor needs way more than that to actually function. Feels like there should be a permission profile scoped to what GSD actually needs without going full skip.