I take a peak every month or so at spend for my company and notice more and more are consumed $1k in tokens a month and it is bewildering to me how. I use llms daily, and see anywhere from $200-$400 tops. This is using the most expensive models, in deep thinking mode. So I'm not a Luddite against the usage of them. I just can't figure how _how_ to burn that much money a month responsibly.
I genuinely challenge someone spending $5-$10k a month to demonstrate how that turns into $50-$100k in value. At a corporate level, I'd much rather hire a junior engineer who spends $100-$200/month and becomes productive then try and rationalize $100k/year in token spend.
First: There's the obvious "If the company is letting me do it, I'll be wasteful." This includes not clearing/compacting the context often. Opus now has a 1M context window, and quality is good to at least 200K. So each query is burning a lot of tokens until you clear/compact.
People have already mentioned the size/complexity of the codebase. I'm new to my team and the codebase isn't huge, but it's large enough that there are plenty of parts I have little understanding about. When I'm given a task, then yes, I definitely go to Claude and ask it to find the relevant parts of code so I can understand the existing workflow before even attempting to change it.
The downside is that I don't build expertise. But the reality is that with Claude, I can get the work done in 1 day that would take me 5 days of struggling, and if everyone is doing it, I can't be left behind. So I take the middle route - I get it done in 2-3 days instead of 1 so I can at least spend some time with the code.
Especially with AI, the rate at which code changes in our codebase is insane. So I built a tool that takes a pull request, and tells the LLM to go deep and explain to me what that pull request does. (Note: I'm not the reviewer, I just want to keep tabs on the work that is going on in the team).
And this is just the beginning. I haven't actually spent time to come up with more ways to use the LLM to help me.
My usage is similar to yours, but if I were fairly experienced with the code base, I'd do a lot more. I haven't asked, but I suspect there are people in my team who go over $1K/month.
As always, the bottleneck is proper testing and reviews.
Edit: I'll also add that for not-so-important code used within the company, I suspect most people are going full-AI with it. For my personal (non-work) code, I just let the AI code it all - the risk is usually very low (and problems are caught quickly). If someone is using the "superpowers" skill, then even for basic features you can burn lots of tokens. I usually start with 20-40K tokens and end up with 80-90K tokens when it's finished. Which means that many of the requests prior to completion were sending in close to 80K tokens. Multiply that with the number of queries, etc.
> This includes not clearing/compacting the context often. Opus now has a 1M context window, and quality is good to at least 200K. So each query is burning a lot of tokens until you clear/compact.
I see this repeated by others, including coworkers. It completely ignores caching. Caching itself is complicated, but the "longer context window = more expensive" is not 100% true and you are hampering yourself if you're not taking full advantage of large context windows.
I have ancedotal examples of claude code choosing a solution to a problem that is ridiculously token inefficient.
One example - was giving several agents different sub problems to solve in a complex ML / forecasting problem. Each agent would write + run + read a jupyter notebook. This worked ok, the notebooks would be verbose but it was fine... until one of them wrote out hundreds of thousands of rows to a cell output, creating a 500MB ipynb file. Claude tried several times to read it and it used my entire context limit.
The solution was to prescribe a better structure of doing the world (via CLI analysis scripts + folders to save research results to). But this required some planning, thought, and design work by me the operator.
When I see people spending $10k a month in tokens, I can only assume they are taking lazy hands off approaches to solving problems with the expensive hammer that is claude code. EX: have claude read all your emails every day... the lazy solution is to simply do that, but a smarter solution is to first filter the email body HTML to remove the noise.
If it’s very large, especially if the tool needs to refer to documentation for a lot of custom frameworks and APIs, you often end up needing very large context windows that burn through tokens faster.
If it’s smaller or sticks with common frameworks that the model was trained on, it’s able to do a lot more with smaller context windows and token usage is way lower.
The codebase and the topic you're working on are huge variables.
I don't use LLMs to write code (other than simple refactors and throwaway stuff) but I do use them heavily to crawl through big codebases and identify which files and functions I need to understand.
Some of the codebases I explore will burn through tokens at a rapid rate because there is so much complex code to get through. If I use the $20 Claude plan and Opus I can go through my entire 5-hour allocation in a single prompt exploring the codebase some times, and it's justified.
Other times I'm working on simple topics, even in a large codebase, and it will sip tokens because it only needs to walk a couple files to get to what it needs to answer my questions.
I'm currently in repos where the context window required is so large that the output is almost always "wrong" for the problem at hand. Quite a few people at my company burn through tokens this way, and it certainly isn't providing value to the company.
As always, improving accessibility for humans makes automation more effective. If the humans need to remember a PhD's worth of source code/documentation to contribute effectively, your codebase stinks.
People at my company have started writing docs specifically for claude. They're quite useful for me too, but kinda disappointing they never wrote these docs for their colleagues.
I recently saw this with the logseq api - the published api was an auto-generated stub. So I tried to grep the source code for the function and found detailed documentation written for claude. So I guess one benefit of all of this is that it's making people actually document things and maybe plan a little bit before implementing.
As someone who has written many docs, it's because 99% won't read it (rightfully so if it's verbose). You can turn that doc into a skill in a repo and Claude will read it everytime it's needed.
The LLM hype train has me reflecting on what a spoiled existence working in a ‘proper’ language provides though…
React devs, JS devs, front-end devs working on large sites and frameworks might be triggering tens of files to be brought into context. What an OCaml dev can bring in through a 5 line union type can look very different in less token-efficient and terse languages.
Begs the question if we should move on to minimal microservices so that whole project lives in context of llm. I hardly have to do anything when I'm working with small project with llm.
Why not take it a step further? Make each function in the codebase its own project. Then the codebase can fit into the context window easily. All you have to do is debug issues between functions calling each other.
Generally speaking no. Treat your IP (the code that runs your business, makes your business competitive or special) as precious and don't make it subservient to infra. It should be in the format (code, architecture, structure) that best serves it.
Like I said, if your work is already contained neatly inside one microservice then it doesn't matter.
The same would be true in a monolith: The context to understand what's happening would be contained to a few files.
When the work starts crossing through domains and potentially requiring insight into how other pieces work, fail, scale, etc. then the microservice model blows up complexity faster than anything, even if you have the API documented.
I've done the opposite, moving multiple tightly coupled repos into a single monorepo. Saves the step of the llm realizing there's a bigger context, finding the repo, then also scanning/searching it. Especially for fixes that are simply one line each in two repos.
Unquestionably good. They want a product that provides value anywhere it's tried so as to establish the reputation as a magic human replacement. Gaming consumption based pricing at this point would be quitting before the race is over. They can always tweak the pricing knobs later once the industry is fully hooked.
One thing that stands out it is it sounds like you're using LLMs for only one part of your process. You're having LLMs help you write code, but the code you're writing doesn't itself make use of LLMs.
My current job basically involves trying to improve processes that themselves make heavy use of LLMs. Once you have multiple agents in parallel running multiple experiments on improving the performance of primarily LLM driven tools it's not that hard to get your token usage pretty high.
In our org it's people that have too much stuff in their context, every mcp in the world installed, GTD, PAI, OpenClaw. I'm equally baffled how one can spend that much money during their day to day.
I'm on the same page. Do people not analyze the problems themselves? Are they just copy/pasting their entire ticket description into Claude Code and having it iterate until they land on something that works?
That's my take as well. I've had my unPRed branches grabbed up and blindly merged by an agent twice now. The guy doing it was shocked both times that his PR had my change sets in it.
Also one engineer is treating the code as assembly. I've asked some pointed questions about code in his PR and the response was "yeah, I don't know that's what the agent did".
Edit:
To everyone freaking out about the second guy. Yeah, I think being unable to answer questions about the code you're PRing is ill advised. But requirement gathering, codebase untangling, and acceptance testing are all nontrivial tasks that surround code gen. I'm a bit surprised that having random change sets slurped up into someone else's rubber stamped PR isnt the thing that people are put off by.
My friend is a CTO at a non-tech company and he's now dealing with code from non-SWEs trying to self serve with LLMs.
But it's like a kid running a lemonade stand. Total DIY weekend project quality stuff that they are demanding go live. Hardcoded credentials, no concept of dev/qa/prod environments, no logging, no tests, no source control.
I'm not really sure teaching basic SWE practices / SDLC / system design to people whose day job is like.. accounting makes sense compared to just accelerating developer productivity.
No, you should have forward deployed engineers sitting and working right beside these traditional non SW roles if you need to fully integrate AI into their mix.
Right, unfortunately a lot of orgs are quickly letting loose some combination of non-tech self-serve AI coding and tech org staffing reductions rather than ADDING forward deployed engineers.
It’s the same dilemma as old: it’s easier to teach a doctor UML than a coder Doctoring. But, critically, that’s about making doctor-facing IT systems not performing their skilled jobs.
Bringing code does not help, but a validated user story with flow diagrams, a UI suggestion, and a valid ticket could. That’s the bridge to gap.
Were I that CTO I’d explain that code carries liability, SWEs can end up in jail for malfeasance, fines, penalties, and lawsuits are what awaits us for eff-ups. “Coders” get fired if their code doesn’t work. Same speech to the devs, do exactly as much unsolicited Accounting as you wanna get fired for. Talk fences, good neighbours.
The ROI on teaching UML to a doctor is pretty low though right?
Non-technical people are not writing tickets, they are just slinging slop.
Another anecdote of things I've seen - a non technical person setting up some web scraping monstrosity with 200k lines of code. They beat their chest about how they didn't need the IT org. 1 month goes by and of course it breaks as soon as anything on the website changes and now they have a gun to ITs head to "fix it" and take it over.
This outcome for a DIY brittle web scraper is obvious to anyone that's ever written code, but shocking to someone who thinks LLMs are magic.
Because he's paid to deliver code that works. Letting an AI agent do everything would be fine if it didn't make any mistakes, but that's far from reality.
Do typesetters inexplicably change the meaning of the book or document being typeset? Do compilers alter the behavior intended by the programmer, sometimes in ways that are not immediately obvious? Did the invention of typesetters lead to investments so massive, that the investors had to herald the end of handwriting (no equivalent analogy for compilers)?
Resistance to technological change has been a thing since farming was invented. Socrates thought that writing will ruin everyone's memory, and that people who just rely on written word will appear knowledgeable while actually knowing nothing.
The only difference is that this is happening to us.
I take the prompts to the AI so the manager doesn't have to! I have prompting skills!!
I just can't make the joke work. There really are people that think they can get paid to press the agent's on button. How long before their checks stop clearing and it "just works itself out naturally"?
That's literally how some Meta AI jobs looked a few years back - set up a few parameters, push a button, wait until training and evals are finished; repeat if. needed. $500k+/year.
It's bizarre to me that people being paid to use their brains with a job title including the word "engineer", which essentially means "clever thought thinker" in Latin, just offloading all of their thinking to a bot instead of just using it as a way to ensure clean execution and faster understanding of the structures of underdocumented projects.
There's some people who are offloading all of their thinking to a bot, and I agree with you that I don't really understand this. But the good version of it is to offload some of your thinking to a bot so you can focus your own thinking on the parts that matter. My time is much better spent on "ah there is a scalability tradeoff here" than "I guess I have to initialize the FooBarProviderServiceProvider in a different spot so that I can pass a mock to the FooBarProvisionConsumer unit tests".
And why wouldn't they? Companies are quite literally instructing them to do so. I work at such a company and have heard similar anecdotes from colleagues that work at other companies.
To be fair, taking an average SWE at $160k/y, and spending $1k/m, and offloading mechanical ticket work from their working set sounds like a bargain to me. They could be spending the time on design and planning and working on new things, figuring out how to save costs in optimizations. In fact for every soul sucking mechanical task you offload, the better of you are overall.
It’s not like AI is the first time this happened. CI/CD and extensive preflight and integration and canary testing is also a way of saving engineer time and improving throughput at the cost of latency and compute resources. This is just moving up the semantic stack.
Obviously as engineers we say “awesome more features and products!” but management says “awesome fewer engineers!” either way pasting the ticket in and letting a machine do the work for a fraction of the cost was the right choice. There’s no John Henry award.
> pasting the ticket in and letting a machine do the work for a fraction of the cost was the right choice
If it were producing equivalent outcomes, sure. So far I haven't personally seeing strong evidence for that. LLMs do write code pretty competently at this point, but actually solving the correct problem, and without introducing unintended consequences, is a different matter entirely
This. LLMs are terrible at planning/architecture and maintaining clarity of vision across a project. There are lots of tools that mitigate these issues but they're going to keep coming up regardless because of the fundamental nature of LLMs.
If you're not doing the design of the solutions for problems as an engineer or at least making the decisions and owning the maintenance of that architecture/design, what even is your job at that point?
> and offloading mechanical ticket work from their working set sounds like a bargain to me
Unfortunately the people who offload the work of understanding and interacting with tickets just end up offloading the consequences to everyone else who has to do extra work to make sure their LLM understands the task, review the work to make sure they built the right thing, and on and on.
The same thing happens when people start sending AI bots to attend meetings: The person freed up their own time, but now everyone else has to work hard to make sure their AI bot gets the right message to them and follow up to make sure what was supposed to happen in the meeting gets to them.
If someone sends a bot to a meeting, warn them the first time. Fire them the second, for exactly the reason that you said in your last paragraph: They're pushing their work onto other people.
Not what I do. I'll reformulate the ticket description so that the purpose and as many details as possible about the solution are made clear from the start. Then I tell Opus to go and research the relevant parts of the codebase and what needs to be done, and write its findings to a research.md file. Then I'll review that file, bring answers to any open questions and hash out more details if any parts seem fuzzy. When the research is sound I'll ask Opus to produce a plan.md document that lists all the changes that need to be made as actionable steps (possibly broken into phases). Then I'll let Sonnet execute the steps one by one and quickly review the changes as we go along.
> Are they just copy/pasting their entire ticket description into Claude Code and having it iterate until they land on something that works?
"Their ticket" = that was AI generated.
After which they will wait their AI generated PR be checked by an automated AI QA that will validate against the AI generated spec.
It feels like important metric of "corporate AI adoption" should be how effective the human in steering the AI.
IF THE HUMAN ISN'T EFFECTIVE, THE HUMAN NEEDS TO GO.
If it manages to solve the working solutions - then it's great! why would you waste your time on it?
It it fails - then it's great! you find your value by solving the ticket, which can be a great example where human can still prevail to the AI (joke: AI companies might be interested to buy such examples)
(All assuming that your time cost is pricier than token spending. Totally different story if your wage is less than token cost)
There’s your problem. You’re trying to be responsible instead of trying to burn tokens so you can have your name on top of some leaderboard for most wasteful AI users.
The leaderboards are dumb, but I understand the point of telling people not to worry about tokens and just use it. They are trying to get people to try it, to discover new uses without asking “is this worth testing”. It’s basically early R&D budget. Eventually these companies will decide it’s time to transition into efficient usage.
The fully loaded cost of a senior engineer is already well past 400k. +5k a month is not that much if it helps them be XX% more productive.
Personally at a different big tech I'm in the mid 4 digits AI spend per month and it helps me a lot, basically all coding has been trivialized and I work on an extremely large codebase. I'm spending more time on things closer to direct value generation like data analysis and experiment tweaking rather than spending time moving a variable across 10 layers of abstraction and making sure code compiles.
> I just can't figure how _how_ to burn that much money a month responsibly.
Same but in regards to quotas. I'm on the 200 EUR ChatGPT plan, so presumable have the highest quota, using the "most expensive" models, on highest reasoning, in fast-mode (1.5x quota usage) and after a full day of almost exclusively doing programming with agents, I still get nowhere close to hitting my quota.
In fact, since I started using agents for coding, the only time I even got close, was when I was doing cross-platform development with the same as above, but on three computers at the same time, then I almost hit my weekly quota. But normally, I get down to ~20% of the quota but almost never below that. I don't see how I could either, I'm already doing lots of prompts and queries "for fun" basically.
There are tools that let you extract out what the API price would be for a subscription plan use. I typically have monthly runs that are on the order of $2k - $4k at API prices, despite paying a mere $200/mo to Anthropic.
Edit: Just checked with ccusage and I've been doing about $450/day for the last week. A bit more than usual, but I still haven't come close to weekly limits and never hit the 5hr rate limit.
> Same but in regards to quotas. I'm on the 200 EUR ChatGPT plan,
The API rates and monthly plan rates are not the same.
If you're using enough to justify the 200EUR plan (instead of the 100EUR plan), your use might actually be as high as some of the API bills discussed above.
Codex quota is suspiciously high right now. Either way, the subscription plans are not sustainable, and perhaps less relevant to any discussion about corporate API use. The prosumer developer plans are an insane deal. It is a golden age right now and it will end. If you tried to use the APIs to achieve the same thing, you would be spending thousands upon thousands of dollars a month. My completely unfounded conjecture is that OpenAI is trying to grab developers back from Claude by burning $$$$.
> If you tried to use the APIs to achieve the same thing, you would be spending thousands upon thousands of dollars a month.
Yeah, obviously, not sure why anyone would be using APIs at this point, seems bananas to spend more than 10 EUR per day when these "almost-endless" subscriptions exists.
> My completely unfounded conjecture is that OpenAI is trying to grab developers back from Claude by burning $$$$.
Unlikely, since codex TUI was launched OpenAI pretty much had every developers pocket already as the agent is miles and leagues ahead of Claude Code, pretty much from inception. No other provider comes close to ChatGPT's Pro Mode either, I don't even think it's a quota/pricing thing, have the best models and people will flock by themselves.
> miles and leagues ahead of Claude Code, pretty much from inception.
Can codex run background tasks yet? CC's ability to run a process in the background and monitor its output for errors while another process access that first process, is probably what got cc so popular for web development over codex to start with.
I am running a bunch of autoresearch loops that optimize various compilers and its pretty easy to burn through as much money as you want if you have a measurable goal and good tests.
I have both of those, yet seemingly I guess I'm not setting my goal in such a way that it supports "endless inference" like that. My goals have eventually ends, and that's when I move on. Optimization sure sounds like something you can throw away a good amount of tokens/quotas on, so yeah.
I dont use automated agent workflows or anything, I just use clause as a pair programmer of sorts. A month or so ago I used claude Opus 4.6 for 2-4 hours on API pricing and racked up $20 in spend, which surprised me since that was much higher than my usual.
I dont know about $10,000, but i can see hitting $1,000 pretty easily if you aren't looking at the costs.
The answer may be agentic loops that keeps cycling through the same problem again and again until they land on a non-erroneous outcome. Some people boast having multiple such agents working in parallel on different problems, tending to one while another is processing, perhaps not unlike the movie mad scientist who runs around the lab throwing switches while laughing maniacally at the prospect of his impending success.
Claude is a mediocre programmer that can do great things with great supervision, but it can't make mediocre human programmers into good ones, because they can't provide great supervision.
id bet its the LLM doom loop: vaguely ask it to do something, tab to news.ycombinator.com for 30 minutes, tab back, noticed it misunderstood the prompt. Restart with new improved prompt, tab back to HN.
So yeah, probably the same thing people do anyway, just not compile time its now generating time.
> I just can't figure how _how_ to burn that much money a month responsibly.
I always have a few agents (2-5) doing research and working on plans in parallel. A plan is a thorough and unambiguous document describing the process to implement some feature. It contains goals, non-goals, data models, access patterns, explicit semantics, migrations, phasing, requirements, acceptance criteria, phased and final. Plans often require speculative work to formulate. Plans take hours to days to a couple of weeks to write. Humans may review the plans or derived RFCs. Chiefly AI reviews the code (multiple agents with differing prompts until a fixed point is reached between them). Tests and formal methods are meant to do heavy lifting.
In my highest volume weeks, I ship low hundreds of thousands of lines of software not counting changes to deps.
> At a corporate level, I'd much rather hire a junior engineer
Any formulation of problem sufficient for a truly junior engineer to execute is better given to an agent. The solution is cheaper, faster, and likely better. If the later doesn't hold, 10 independent solutions are still cheaper and faster than a junior engineer.
There is no longer any likely path to teaching a junior engineer the trade.
I am sorry, I am probably just very dumb, but this sounds extremely wasteful. If this is a reflection of how software was made before AI I wonder how anything was ever made.
I don't know about the GP, but my workflow is similar to theirs, but I aim to ship low thousands of lines per week. The fewer the better. I even tell the agent to only write high SNR tests, otherwise it just adds useless "make sure this function returns this thing we hardcoded".
I usually succeed, BTW. I spend a lot of time planning, but usually each PR is a few hundred lines, and fairly easily reviewable.
I mostly work with Python backends, though these days it might be any language (Ruby, Go, TS).
I dunno I've seen agents make boneheaded mistakes even a junior engineer wouldn't make. Treating them as strictly better than junior engineers is a problem, not just for that reason but because you're effectively killing the pipline for senior engineers. Then what?
Several options on how to burn that amount of money without being specifically looking to tokenmaxx
- Agents that spawn other agents
- Telling agents to go look at the entire codebase or at a lot of documents constantly
- MCP/API use with a lot of noise
- Loops where the agent is running unattended.
I do think it's not really responsible use and a loop where the agent is trying to fix CI for one hour for something that would take you five minutes (for example) is absurd. But people do that.
One of the new dynamics is a loop between a "code review" LLM and a "fix LLM". It's super annoying because the code review LLM often finds more bugs on a follow-up review that were there from the beginning, but at least I can loop both until check go green.
Do lots of deep research and code reviews on large legacy codebases. I've created lots of documentation to reduce token consumption but it's still a lot of token consumption.
I spend 400-500 dollars per day during active development at this point. However with more aggressive task breakdowns I can spend ~5k per day.
These spend rates are in part due to operating on a larger code base. Operating on a larger code base means more time searching and understanding the code, tests, test output. They are also due to going all-in on agentic coding.
It can feel painfully slow to go back to coding by hand when for a dollar you can build the same functionality in a minute. Now do this with multiple sessions and you can see where the cost goes.
The problem with HN is that everyone here thinks like an engineer, not like a business owner.
$10k a month on tokens is just not that much when you're already making $2M per engineer. If their productivity has increased even 10% then the spend was well worth it.
Case in point, Meta made 33% more revenue this earnings report. Now you can nitpick and ask for attribution down to the dollar, but macro trends speak for themselves.
Go look up a multi-year chart of their revenue and find the inflection point where the AI made it go up faster (there isn't). In fact revenue growth used to be higher pre-2023.
I don't think it's about value. Tokenmaxxing is a thing now since that one CEO said he wants his $250k/yr devs to use $400-$500k/yr in tokens, so now it's all about how many agents can you have running concurrent tasks all day long.
It turns out writing good prompts helps to keep token usage down as the model wastes fewer tokens discovering context it needs that wasn't hinted at in the prompt.
Whereas a good prompt will give solid leads to all the specifics needed to complete the task.
That would be true in a sane world with investors who value profitability. But everything is now focused on DAU and the network effect. Overusing their services might actually make them look better to investors who shovel more money to them to light on fire.
In addition to what folks are saying here about larger code bases and multiple features at once, there’s also the time requirement to be efficient. It takes time to be more efficient with token usage and it may not be worth it for some of these companies so… burn away until we start to get more data and then we’ll check in.
On the OpenAI side, GPT-5.5 generates spend at a prolific rate that's even faster if you use it through an ACP connection in a tool like Zed. I used to never think about Codex rate limits and now I'm hitting mine every 5 hour block and spending ~$100/day on top of that in adhoc credit purchases.
There was a tool posted called codeburn that showed a breakdown of what activity your usage was spent on. Mine was almost all coding but other people in the thread said >50% of their usage was conversation. I’m inclined to agree with you that someone who is reasonable with their compute usage is likely to be thinking things through rather than just burning tokens to get an LLM to solve the problem
You are probably guiding them step by step and reading the results. Maybe you also sit and wait for the results.
Agents can iterate on a problem for hours if they can see their results and be given a higher level goal to evaluate their progress toward.
When you have an agent working for minutes or hours, never wait on it. Use that time to spin up another agent.
You can also spin up several agents in parallel to attempt the same item of work and compare their results to choose which to work off for next steps, instead of rolling the dice on a single option at a time and gambling that it's better to refine that first attempt instead of retrying from the start several more times.
And if you are doing manual QA manually, you're missing out on having e.g. Codex's "Computer Use" or "Browser Use" automate your manual verification steps and collecting a report for you to review more quickly. Codex can control multiple virtual cursors simultaneously in the background without stealing focus, to parallelize this.
If you want to use up more tokens to get more done (though more outside of your control and ability to review of course), that's how.
I'm working on some serious data analysis + realtime async code, and I use 200-400 million tokens a day with Claude Code alone (via ccusage). The complexity of the code seems to have a big impact on the number of tokens used. On simpler projects I use many fewer tokens.
My programming endurance is much greater now (2-3x focused hours per day), my productivity per hour is multiples higher, and I code seven days a week now because it's really exciting.
All told, I would pay for these tools as much as I would pay for full-time human programmer(s).
It really depends on the way you use AI. If you just prompt it for a task and either accept or reject the output, you won't spend much.
But if you are like me, you aggressively document and brainstorm before planning, you review that documentation with subagents, make modifications, you aggressively plan, you verify that plan with subagents,make modifications, have a large number of phases, planning again for each phase, writing tests to cover 100%, implement each phase, do intermediate and final code reviews with subagents, apply fixes, write final documentation and do all these in parallel, if you have multiple tabs in your terminal each running Claude Code for 10-12 hours a day, then $5000 per day is not much.
If you use Anthropic or Open AI subscription and you spend $1000 per month, you are not using AI much.
I spent $24,096.47 in "API" costs with my $200 Claude Code Max subscription in April.
I'm building my own saas. I spent 6 months writing the code by hand before using Claude, and that was fine, but its much faster to give the exact specs to Claude and have 3-4 sessions working in parallel with me. When you validate changes with exact test specs there's much less correction you need to do. I always hit my weekly limit and it's far cheaper for me to use this than to hire someone and spend time onboarding them.
> I'd much rather hire a junior engineer who spends $100-$200/month
I'd much rather hire a junior engineer at $1.20/hour too! Can you hook me up with your contract services provider?
Obviously I know you're talking about AI costs only. But the idea of doing that analysis without looking at the salary of the person running the tool seems to be completely missing the point.
Now, sure, there are legitimate arguments to be made about efficacy and efficiency and sustainability and best practices. But, no, $100k/year absolutely doesn't need to be "justified" if it works. That's cheaper than the alternative, and markedly so.
FWIW, you're nitpicking a strawman. I put "justified" in scare quotes for a reason, qualified it with "if it works" (which is, quite literally, the definition of a justification) and put it immediately after a sentence enumerating a list of legitimate questions for debate (all of which would be part of any justification analysis).
You agree with me, basically.
The core point is that these very large AI bills are not actually large in context, as the pre-existing scale of expenses for software engineering are larger still and this at least promises to reduce those markedly.
To wit: argue about whether AI works[1] for software development, don't try to claim it's too expensive, it's clearly not.
They keep forgetting to put "make no mistakes", "think deeply" and "get it right the first time" in their prompts.
When people have no ability to understand what they are doing, they will just rerun it endlessly hoping they get something passable. When that doesn't happen they burn money.
I doubt most of this is from rerunning the same prompts over and over. This token burn is more likely from people using swarms of agents and orchestrators for “efficiency”.
“I’ve got 2 dozen agents churning through the backlog to build this feature that would take one agent an hour to implement.”
In your fictional world you hire a junior who will write code manually, right?
First , I interview people, Junior skills in manual coding dropped sharply this year. These are people who started they school manual and switched mid-course. In two years there will be no such people.
well, that will never happened anymore in this world unless we will go back to caves, especially for juniors. Junior that writes good code is already a dying unicorn.
The outcome will be ... you will hire a junior ... who will burn more tokens, and chances of mistakes with less expensive model, less tokens are even higher.
I mean even the normal people we get in interviews have no clue, like 80% are just ignorant.
I stoped an interview after 5 minutes: when i asked what ls -ahl is doing, he started telling me how he vibe/ai codes stuff and thats his workflow. Okay if you don't know the basics, guess what? everyone can replace you or at least i'm not hiring you (i only told him thats not what we are looking for and thanked him)
I'm interviewing juniors. Their manual skills drops sharply, and that's for people who went to school in manual age, and maybe last year it stopped to be manual. Lets see what will be in a year or two lol
At this point do you even need to hire any juniors at all? It seems like there's a heavy reliance on AI agents and LLM especially for juniors. Is hiring a warm body that sits on a chair and prompts at a computer a good use of money?
yes, and no. Everybody is trying to hunt junior unicorn, They exist , but the ratio is 1 out 30. For these people, AI is a real elevator of their career.
> If it was actually productive, then the revenue would increase and affordability wouldn't be a question.
Revenue has increased. Have you seen Meta's latest earnings? +33% revenue - in this economy.
Affordability is not a question. There is a reason companies like Meta have no issue with their engineers spending $1k/day on tokens. It's just not that much compared to how much they make per employee.
Yes, my thoughts exactly. Productivity by definition creates things, hopefully valuable things. Is all the extra burn on chatbots worth the cost? Has Uber somehow gotten dramatically more efficient and effective due to this massive budget overrun? Or have they just given people shiny and expensive ways to push the same work around?
I'd argue it's often the contrary -- since it's easy to ship features and fixes, people often ship things without questioning if it makes business sense to support a use case, or if the design is solid. Now you have exactly the same revenge but more things to maintain
This is my thought too. The eggheads in accounting set budgets, and we produce products within that budget. I could be twice as productive with twice as many people, and maybe 50% more productive with good AI, but if it's not budgeted for it's an issue (especially short-term before the product is released).
That is not true at all. No matter how "productive" a company is means nothing if people aren't buying your product. And using LLMs to be more productive will not convince anyone to buy your product. Human creativity and intuition to make a product that people want to use is what sells. Productivity for productivity's sake doesn't really move the needle at all, and can make things worse.
It's actually incredible the extent to which non devs imposing KPIs on devs underestimate how badly this will get gamed, whether it's AIs, PR/line counting or whatever.
Gaming is one thing, fundamentally not understanding how engineering works will lead to shittier outcomes and cost the company in ways the management will never understand.
Management in the age of AI is falling for the doorman fallacy wrt engineering. If lines of code were the most valuable aspect of software engineering, my front end JavaScript intern would’ve been the most valuable person in the company. https://www.jaakkoj.com/concepts/doorman-fallacy
I actually do this, but that's mostly because our team reviewed all the existing autoformatters for the relatively obscure language we use, and either really hated the formatting or found that they actually introduced errors!
I don't understand this critique.
(1) Did you previously think you weren't getting paid for doing what a company wants you to do, aka what THEY thought was productive?
(2) Do you think all this AI generated code is useless?
I think the point was that, when you make a metric goal of "you must use AI this much", then people will use AI even in ways that isn't adding to productivity.
To answer your second question: Yes, much of it is worse than useless. The tools need guidance to produce useful output. If you use it poorly, you will get garbage output that may do more harm than good.
And your response does not address the point being made in the comment you replied to: Many people are being evaluated by how many tokens they burn, which is about as good a metric as lines of code written.
I think parent is saying "% of code being generated by AI" is not a generally good, direct metric for business value. It's akin to the "we are pushing SO MUCH CODE" phase of early ai marketing.
If we're trying to measure the value of adopting tool, it's probably better to measure the ROI of that tool rather than the usage % of that tool, especially when usage is basically mandated.
To directly answer your questions:
1. You're being paid to create value for the business, which "doing what they think is productive" is a proxy for. You're not being paid to use a tool a high % of the time.
2. I doesn't seem like parent even commented on the quality of the code generated. I think anyone that uses it regularly can agree that: a) the code is not useless and
b) all generated code is not immediately production ready c
) AI generation of code is an accelerant for software development
1) I think if the company I work for spends too much effort on things that aren't going to make money, they won't be able to pay me anymore, no matter what they "think" is productive. That's not how executives at companies like this make decisions, though.
Goodhart's Law isn't a problem immediately. If you want more code to be written, and the only feasible way to write it to goals is to heavily use AI, then you might run into the problems of AI-generated code, and an infrastructure that's poorly architected and much less understood than it would've been ten years ago.
> (1) ...getting paid for doing what a company wants you to do...?
At my previous company, when the thing they thought they wanted me to do (which was not the thing they actually wanted... but whatever) diverged from my values I quit. You can just do things.
> (2) Do you think all this AI generated code is useless?
Almost universally, yes. Especially in organizations that historically haven't been particularly careful about hiring and have a huge number of young, inexperienced people. There are exceptions but they're rare enough that throwing that particular baby out with the bathwater isn't a big loss.
1. At my level, the company is not just paying me to do a task the way they want it done, they are paying for my experience to orchestrate the best way to do it. They want an outcome, and I'm responsible for figuring out how to get to that outcome with the right balance of cost, correctness, etc. But yes, the most dystopian reality is what you said.
2. It's not useless, but the AI generated code is absolutely lower quality than what I would have written myself, but there is no desire to clean it up. Companies have always had a disastrously bad understanding of technical debt and they finally have tool they can shove down developers throats that trades even more velocity for even less quality. They're going to take that trade every single time.
you're missing their point; LLM use is often a part of your evaluation at some of these larger companies and they expect you to use them heavily or you will get a lashing
GP just saying that any metric will be gamed and if you have some costs that is associated to that, it will grow. Let’s say you set some metric that says the most productive dev are the ones that has the most files changes, you can soon expect every function and structure to be its own file. Same if you say that sales commision are based on how much time you spend calling, expect the phone bills to grow a lot.
> figuring out if the company can afford this level of productivity at scale
This is the thing that boggles my mind. They spent their budget. They have 4 months of data. What do they have to show for it?
I'm not a hater; I'm not a luddite. I have a $200 Max plan and I use it.
But are you saying that Uber made this tool available, urged everybody to use it, and is confused about what happens when it worked? It's one thing if they decide AI isn't productive enough to be worth the cost.
Are they out of ideas on what to build next, or something?
The personal max and teams plan actually are an amazing bargain compared to the API PAYG cost you get with Enterprise. I guess they really need their Enterprise features though, otherwise they could just tell users to expense a $200 max sub. Enterprises gonna Enterprise.
> I'm not a hater; I'm not a luddite. I have a $200 Max plan and I use it.
I'm glad to see we've reached the point of AI discourse at which anything that might be construed as criticism must be prefixed by "I'm also part of the cult, I'm not a non-believer, but" to avoid being dismissed as a heretic.
While this is a fundamentally stupid story to begin with, it was at least reported somewhat better in other venues. The original report came from The Information, and at least this Yahoo Finance[0] writeup mentioned that. This article has very little content and no sourcing.
According to [1], there are about 5500 people in Engineering at Uber. Using $1250 as the mid-point of the $ spend range, that comes to about $6.8 Million in engineering AI spend, ballpark, with the range being $2.75 Million - $12 Million. The article lists $3.4 Billion as the R&D spend.
The AI spend does not appear to be a significant chunk of R&D spending (0.3% in 4 months or 1% annualized). If they didn't plan for it, sure, it's not peanuts in the budget, but in context not that much.
The real question is, what did they get for that amount? The article claims that 70% of the code commit is now AI-generated, so presumably the code passed review and tests. Did it accelerate the feature count? did it reduce quality problems? Did it lead to other benefits?
Sadly the article is silent on the outcomes, besides the higher spend.
Maybe 4 months is too soon to assess the benefits. On the other hand, in an agile world ...
The actual source https://www.theinformation.com/newsletters/applied-ai/uber-c... says "about 11% of real, live updates to the code in its backend systems are being written by AI agents built primarily with Claude Code, up from just a fraction of a percent three months ago" and "He wouldn’t disclose exact figures of the company’s software budget or what it spends on AI coding tools."
I think as it becomes more common for executives to think we can replace software engineering with agents, I wonder if they might be basing their decisions off of unrealistic perceptions of the average software engineer. I guess I'm mulling two somewhat contradictory senses:
1. You get out of it what you put into it. A savvy CTO might be incredibly excited by everything they can do with agents, and improperly think that all the software engineers can do the same thing, when in reality your org's average software engineers might not have the creativity to even think of many cases where it could save them work. So by mandating agent usage, you might find that productivity hasn't improved while AI costs have increased.
2. When using AI, there are two gaps that become more obvious. First is the gap of: who tells the agent what to do? In many orgs, product isn't technically savvy enough to come up with a detailed spec/plan that LLM can use. And many cog-in-machine developers aren't positioned to come up with the spec, they just want to implement it. By expecting work to be implemented by agent-using developers, you might instead find a lot of idle workers waiting for work to show up. Second is the qa/review cycle. You've introduced a big change to the org but are you really saving cost or shifting it?
I'm all for introducing LLM as optional to help existing developers increase velocity and quality, but I think the "let's restructure the org" movement is really dicey, especially for mid-size or smaller employers.
Beyond that, it's a force multiplier and it doesn't care if the force is positive or negative. Someone with poor software engineering principals can use AI to make an absolute mess quickly.
It's very easy to blow through hundreds of dollars a session using API tokens especially with the 1m context if you aren't careful about clearing old context.
At the same time the subscription will allow the same usage for hundreds of dollars a month.
Either Anthropic is absolutely hosing API users, massively subsidizing subscriptions, or a little bit of both.
"Cursor estimated last year that a $200-per-month Claude Code subscription could use up to $2,000 in compute, suggesting significant subsidization by Anthropic. Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute"
Really curious how many people actually get close to that level of usage? Their general business plan only offers the $100 version, with pay-as-you-go above that.
If 95% of people are using $100 of value a month, the whales may not be hurting them that badly.
Anthropic has a very "interesting" business model where you get subscription pricing as long as you are under 150 employees. When you hit 151, you have to start paying API prices overnight for everyone, and your total bill instantly multiplies.
They are getting you hooked on cheaper tokens, then raking you in when you get scale. I'm sure Uber gets a break on list price, but I doubt they are anywhere near <150 employee subscription pricing.
Yeah, it's basically the opposite of how "product-led growth" SaaS works. Generally pay-as-you-go pricing is expensive at scale, but attractive initially. So you start on a pay-as-you-go plan, but as you scale you end up transitioning off pay-as-you-go to a negotiated commit. I.e. you call sales and sign a contract. Anthropic basically flips that around backwards.
I evaluated the pricing and could not justify the jump to Enterprise from Team. You lose the monthly subscription entirely when you jump to enterprise so you lose your ability to control costs.
You can cap per user, but not having the rolling cap are you really just going to tell a member of your team “No AI for the rest of the month”
Have we reached a point yet where companies are spending millions a year on software licenses, cloud and AI to the point where the return isn't worth it?
Years ago I did work for a company that was spending over a million on Oracle product licenses and I was part of the consultant team they hired to rip it all out and just go for simple maintainable code based on open source products. Not only did it transform into a codebase that the average newly hired developer could maintain, you also had the savings of not paying Oracle a significant portion of your revenue.
I feel like that will repeat itself in a few years time with the current cloud and AI train everyone is on.
I haven't been in a professional setting for a while, I just code for fun nowadays so perhaps I'm somewhat out of the loop.
I love how these articles drop, and all of a sudden HN is filled with people who think engineering productivity is simple to measure.
Yes, productivity implies revenue (or cost reduction), and revenue is measurable.
However:
1. You spend money today to build features that drive revenue in the future, so when expenses go up rapidly today, you don’t yet have the revenue to measure.
2. It’s inherently a counterfactual consideration: you have these features completed today, using AI. You’re profitable/unprofitable. So AI is productive/unproductive, right? No. You have to estimate what you would’ve gotten done without AI, and how much revenue you would’ve had then.
3. Business is often a Red Queen’s race. If you don’t make improvements, it’s often the case that you’ll lose revenue, as competitors take advantage.
4. Most likely, AI use is a mixture of working on things that matter and people throwing shit against the wall “because it’s easy now.” Actually measuring the potential productivity improvements means figuring out how to keep the first category and avoid the second.
This isn’t me arguing for or against AI. It’s just me telling you not to be lazy and say “if it were productive you’d be able to measure it.”
> HN is filled with people who think engineering productivity is simple to measure.
I think the prevailing (correct) consensus is that developer productivity is actually very hard to measure, and every time it is attempted the measure is immediately made a target making the whole thing pointless even if it had been a solid measurement- which it wasn't.
IDK where you're getting the idea here that measuring productivity of anyone who isn't a factory worker is easy.
I mean, the option is not zero productivity or some productivity: it could be negative.
We doubt the productivity because we have enough experience with Claude Code to know that flooding your organization with that many tokens isn't just unproductive, it's actively harmful.
Minor shifts in productivity are hard to measure. Major jumps in productivity would be obvious. I think it’s clear that, if AI is affecting productivity, it’s to a minor degree at best.
If it were 10x productive you'd be able to measure it indirectly, you'd be unable to avoid measuring it. So the initial claims were clearly lies. The research question is:
Is it >1.0x productive?
I agree that's very hard to measure. But given what this shit costs, it had better be answerable, and the multiple had better justify the cost.
> Monthly API costs per engineer ranged from $500 to $2,000 as adoption skyrocketed across the company.
That's...not exactly a lot per engineer. It sounds like they just didn't budget correctly. Especially if the net of that work is more features that would have otherwise required hiring more engineers, which would cost a lot more than $500 to $2000 a month.
No, it's really not a lot at all, especially if you've got a mandate to maximize your AI usage, which many engineering orgs have right now. I burned $216 USD using Claude Code in March just doing some casual development on the side and certainly not as a part of any professional workplace mandate.
Most people don't have the team and time to do heavy token efficiency engineering. But that's all we do. marketplace.neurometric.ai has a bunch of task specific small models, and we charge flat monthly fees. We bear the token risk.
If they burned through their ML budget in four months while using heavily subsidized models, we're going to see companies burn through their ML budgets in less than a week once those subsidies are no longer in place and they have to pay per tokens used.....
I think the tech industry in general is taking advantage of the fact that software productivity is hard to quantify to say whatever they want about their AI productivity gains. Apparently we are past the point of having to justify anything and can just equivocate increased AI spend with success.
AI coding tools probably need the same boring governance as cloud spend: budgets, alerts, team-level visibility, and a way to spot runaway usage before finance notices.
this is pointless without knowing what they are measuring. you could genuinely moving faster or you could be optimizing for engineers in a rat race to push more code because all their peers are now doing it because those are the metrics you are measuring for "ai productivity".
we run an agentic pipeline in a different domain (data sourcing) and the only
way the math works is to be ruthless about which stages actually need which model.
As a founder, the question I always have is "what is the marginal value per token relative to engineer-hours saved." More of a gut feel at the moment, but would be great to calculate.
It's obvious that the word productivity has been used in this discussion to mean something other than the plain meaning of the word. If AI was productive, there would be no question about whether it could be afforded. If you're asking whether you can afford it then it isn't productive by definition.
They are using it to mean a mechanism that produces prodigious amounts of toxic waste. That does not conform to the historical understanding of the word.
This continues to boggle my mind so hopefully somebody can explain how this is happening.
I’ve been using all these tools since they started popping out around 2021 personally and professionally. I probably built four or five products at this point with assistance, not to mention the thousands and thousands of back-and-forth conversations for research or search or rubber ducking or whatever.
I have never spent more than whatever the professional max plan is that is consistently $20 a month.
I asked a friend of mine who spent a couple hundred dollars in like an few hours how they did it. The answer was they basically getting these agent groups of agents stuck in a loop and they’re constantly just generating verbose bullshit that is not even interrogated and doesn’t come out with any artifact that is inspectable no matter how expert you are.
The couple of stories I have heard of these massive crazy spends are people literally just assuming these things can complete an entire human task in one shot, so they continue to hit the “spin the wheel” button until they get something closer to what they want
But I’ve yet to see that actually work
and it actually flies in the face of every instruction guide or documentation or prompt engineering process that has been described over the last almost 5 years
I genuinely challenge someone spending $5-$10k a month to demonstrate how that turns into $50-$100k in value. At a corporate level, I'd much rather hire a junior engineer who spends $100-$200/month and becomes productive then try and rationalize $100k/year in token spend.
People have already mentioned the size/complexity of the codebase. I'm new to my team and the codebase isn't huge, but it's large enough that there are plenty of parts I have little understanding about. When I'm given a task, then yes, I definitely go to Claude and ask it to find the relevant parts of code so I can understand the existing workflow before even attempting to change it.
The downside is that I don't build expertise. But the reality is that with Claude, I can get the work done in 1 day that would take me 5 days of struggling, and if everyone is doing it, I can't be left behind. So I take the middle route - I get it done in 2-3 days instead of 1 so I can at least spend some time with the code.
Especially with AI, the rate at which code changes in our codebase is insane. So I built a tool that takes a pull request, and tells the LLM to go deep and explain to me what that pull request does. (Note: I'm not the reviewer, I just want to keep tabs on the work that is going on in the team).
And this is just the beginning. I haven't actually spent time to come up with more ways to use the LLM to help me.
My usage is similar to yours, but if I were fairly experienced with the code base, I'd do a lot more. I haven't asked, but I suspect there are people in my team who go over $1K/month.
As always, the bottleneck is proper testing and reviews.
Edit: I'll also add that for not-so-important code used within the company, I suspect most people are going full-AI with it. For my personal (non-work) code, I just let the AI code it all - the risk is usually very low (and problems are caught quickly). If someone is using the "superpowers" skill, then even for basic features you can burn lots of tokens. I usually start with 20-40K tokens and end up with 80-90K tokens when it's finished. Which means that many of the requests prior to completion were sending in close to 80K tokens. Multiply that with the number of queries, etc.
Wasteful, but if someone else is paying ...
I see this repeated by others, including coworkers. It completely ignores caching. Caching itself is complicated, but the "longer context window = more expensive" is not 100% true and you are hampering yourself if you're not taking full advantage of large context windows.
One example - was giving several agents different sub problems to solve in a complex ML / forecasting problem. Each agent would write + run + read a jupyter notebook. This worked ok, the notebooks would be verbose but it was fine... until one of them wrote out hundreds of thousands of rows to a cell output, creating a 500MB ipynb file. Claude tried several times to read it and it used my entire context limit.
The solution was to prescribe a better structure of doing the world (via CLI analysis scripts + folders to save research results to). But this required some planning, thought, and design work by me the operator.
When I see people spending $10k a month in tokens, I can only assume they are taking lazy hands off approaches to solving problems with the expensive hammer that is claude code. EX: have claude read all your emails every day... the lazy solution is to simply do that, but a smarter solution is to first filter the email body HTML to remove the noise.
But that is exactly what it is sold to people to do as a panacea: consume all the data, produce insights.
Nobody is being instructed to be judicious. Everyone is being instructed to use it as much as possible for all problem areas.
If it’s very large, especially if the tool needs to refer to documentation for a lot of custom frameworks and APIs, you often end up needing very large context windows that burn through tokens faster.
If it’s smaller or sticks with common frameworks that the model was trained on, it’s able to do a lot more with smaller context windows and token usage is way lower.
I don't use LLMs to write code (other than simple refactors and throwaway stuff) but I do use them heavily to crawl through big codebases and identify which files and functions I need to understand.
Some of the codebases I explore will burn through tokens at a rapid rate because there is so much complex code to get through. If I use the $20 Claude plan and Opus I can go through my entire 5-hour allocation in a single prompt exploring the codebase some times, and it's justified.
Other times I'm working on simple topics, even in a large codebase, and it will sip tokens because it only needs to walk a couple files to get to what it needs to answer my questions.
The LLM hype train has me reflecting on what a spoiled existence working in a ‘proper’ language provides though…
React devs, JS devs, front-end devs working on large sites and frameworks might be triggering tens of files to be brought into context. What an OCaml dev can bring in through a 5 line union type can look very different in less token-efficient and terse languages.
The monolithic codebases are easier to crawl for any problem that can't be conveniently isolated to a single microservice.
The same would be true in a monolith: The context to understand what's happening would be contained to a few files.
When the work starts crossing through domains and potentially requiring insight into how other pieces work, fail, scale, etc. then the microservice model blows up complexity faster than anything, even if you have the API documented.
My current job basically involves trying to improve processes that themselves make heavy use of LLMs. Once you have multiple agents in parallel running multiple experiments on improving the performance of primarily LLM driven tools it's not that hard to get your token usage pretty high.
I don't get it.
That is exactly what they are doing, yes
Also one engineer is treating the code as assembly. I've asked some pointed questions about code in his PR and the response was "yeah, I don't know that's what the agent did".
Edit:
To everyone freaking out about the second guy. Yeah, I think being unable to answer questions about the code you're PRing is ill advised. But requirement gathering, codebase untangling, and acceptance testing are all nontrivial tasks that surround code gen. I'm a bit surprised that having random change sets slurped up into someone else's rubber stamped PR isnt the thing that people are put off by.
But it's like a kid running a lemonade stand. Total DIY weekend project quality stuff that they are demanding go live. Hardcoded credentials, no concept of dev/qa/prod environments, no logging, no tests, no source control.
I'm not really sure teaching basic SWE practices / SDLC / system design to people whose day job is like.. accounting makes sense compared to just accelerating developer productivity.
Bringing code does not help, but a validated user story with flow diagrams, a UI suggestion, and a valid ticket could. That’s the bridge to gap.
Were I that CTO I’d explain that code carries liability, SWEs can end up in jail for malfeasance, fines, penalties, and lawsuits are what awaits us for eff-ups. “Coders” get fired if their code doesn’t work. Same speech to the devs, do exactly as much unsolicited Accounting as you wanna get fired for. Talk fences, good neighbours.
Non-technical people are not writing tickets, they are just slinging slop.
Another anecdote of things I've seen - a non technical person setting up some web scraping monstrosity with 200k lines of code. They beat their chest about how they didn't need the IT org. 1 month goes by and of course it breaks as soon as anything on the website changes and now they have a gun to ITs head to "fix it" and take it over.
This outcome for a DIY brittle web scraper is obvious to anyone that's ever written code, but shocking to someone who thinks LLMs are magic.
The only difference is that this is happening to us.
I just can't make the joke work. There really are people that think they can get paid to press the agent's on button. How long before their checks stop clearing and it "just works itself out naturally"?
I can do so much more with my spare time now. I throw agents at problems and get way more done.
$1k in tokens every day is easy to hit.
It’s not like AI is the first time this happened. CI/CD and extensive preflight and integration and canary testing is also a way of saving engineer time and improving throughput at the cost of latency and compute resources. This is just moving up the semantic stack.
Obviously as engineers we say “awesome more features and products!” but management says “awesome fewer engineers!” either way pasting the ticket in and letting a machine do the work for a fraction of the cost was the right choice. There’s no John Henry award.
If it were producing equivalent outcomes, sure. So far I haven't personally seeing strong evidence for that. LLMs do write code pretty competently at this point, but actually solving the correct problem, and without introducing unintended consequences, is a different matter entirely
If you're not doing the design of the solutions for problems as an engineer or at least making the decisions and owning the maintenance of that architecture/design, what even is your job at that point?
Unfortunately the people who offload the work of understanding and interacting with tickets just end up offloading the consequences to everyone else who has to do extra work to make sure their LLM understands the task, review the work to make sure they built the right thing, and on and on.
The same thing happens when people start sending AI bots to attend meetings: The person freed up their own time, but now everyone else has to work hard to make sure their AI bot gets the right message to them and follow up to make sure what was supposed to happen in the meeting gets to them.
"Their ticket" = that was AI generated. After which they will wait their AI generated PR be checked by an automated AI QA that will validate against the AI generated spec.
It feels like important metric of "corporate AI adoption" should be how effective the human in steering the AI.
IF THE HUMAN ISN'T EFFECTIVE, THE HUMAN NEEDS TO GO.
If it manages to solve the working solutions - then it's great! why would you waste your time on it?
It it fails - then it's great! you find your value by solving the ticket, which can be a great example where human can still prevail to the AI (joke: AI companies might be interested to buy such examples)
(All assuming that your time cost is pricier than token spending. Totally different story if your wage is less than token cost)
There’s your problem. You’re trying to be responsible instead of trying to burn tokens so you can have your name on top of some leaderboard for most wasteful AI users.
Same but in regards to quotas. I'm on the 200 EUR ChatGPT plan, so presumable have the highest quota, using the "most expensive" models, on highest reasoning, in fast-mode (1.5x quota usage) and after a full day of almost exclusively doing programming with agents, I still get nowhere close to hitting my quota.
In fact, since I started using agents for coding, the only time I even got close, was when I was doing cross-platform development with the same as above, but on three computers at the same time, then I almost hit my weekly quota. But normally, I get down to ~20% of the quota but almost never below that. I don't see how I could either, I'm already doing lots of prompts and queries "for fun" basically.
Edit: Just checked with ccusage and I've been doing about $450/day for the last week. A bit more than usual, but I still haven't come close to weekly limits and never hit the 5hr rate limit.
The API rates and monthly plan rates are not the same.
If you're using enough to justify the 200EUR plan (instead of the 100EUR plan), your use might actually be as high as some of the API bills discussed above.
Yeah, obviously, not sure why anyone would be using APIs at this point, seems bananas to spend more than 10 EUR per day when these "almost-endless" subscriptions exists.
> My completely unfounded conjecture is that OpenAI is trying to grab developers back from Claude by burning $$$$.
Unlikely, since codex TUI was launched OpenAI pretty much had every developers pocket already as the agent is miles and leagues ahead of Claude Code, pretty much from inception. No other provider comes close to ChatGPT's Pro Mode either, I don't even think it's a quota/pricing thing, have the best models and people will flock by themselves.
Can codex run background tasks yet? CC's ability to run a process in the background and monitor its output for errors while another process access that first process, is probably what got cc so popular for web development over codex to start with.
I have both of those, yet seemingly I guess I'm not setting my goal in such a way that it supports "endless inference" like that. My goals have eventually ends, and that's when I move on. Optimization sure sounds like something you can throw away a good amount of tokens/quotas on, so yeah.
I dont know about $10,000, but i can see hitting $1,000 pretty easily if you aren't looking at the costs.
It will try and try and try, though.
So yeah, probably the same thing people do anyway, just not compile time its now generating time.
I always have a few agents (2-5) doing research and working on plans in parallel. A plan is a thorough and unambiguous document describing the process to implement some feature. It contains goals, non-goals, data models, access patterns, explicit semantics, migrations, phasing, requirements, acceptance criteria, phased and final. Plans often require speculative work to formulate. Plans take hours to days to a couple of weeks to write. Humans may review the plans or derived RFCs. Chiefly AI reviews the code (multiple agents with differing prompts until a fixed point is reached between them). Tests and formal methods are meant to do heavy lifting.
In my highest volume weeks, I ship low hundreds of thousands of lines of software not counting changes to deps.
> At a corporate level, I'd much rather hire a junior engineer
Any formulation of problem sufficient for a truly junior engineer to execute is better given to an agent. The solution is cheaper, faster, and likely better. If the later doesn't hold, 10 independent solutions are still cheaper and faster than a junior engineer.
There is no longer any likely path to teaching a junior engineer the trade.
I usually succeed, BTW. I spend a lot of time planning, but usually each PR is a few hundred lines, and fairly easily reviewable.
I mostly work with Python backends, though these days it might be any language (Ruby, Go, TS).
But 10x faster also gets you to market sooner. Which has value.
- Agents that spawn other agents
- Telling agents to go look at the entire codebase or at a lot of documents constantly
- MCP/API use with a lot of noise
- Loops where the agent is running unattended.
I do think it's not really responsible use and a loop where the agent is trying to fix CI for one hour for something that would take you five minutes (for example) is absurd. But people do that.
These spend rates are in part due to operating on a larger code base. Operating on a larger code base means more time searching and understanding the code, tests, test output. They are also due to going all-in on agentic coding.
It can feel painfully slow to go back to coding by hand when for a dollar you can build the same functionality in a minute. Now do this with multiple sessions and you can see where the cost goes.
> I genuinely challenge someone spending $5-$10k a month to demonstrate how that turns into $50-$100k in value.
$10k a month on tokens is just not that much when you're already making $2M per engineer. If their productivity has increased even 10% then the spend was well worth it.
Case in point, Meta made 33% more revenue this earnings report. Now you can nitpick and ask for attribution down to the dollar, but macro trends speak for themselves.
Whereas a good prompt will give solid leads to all the specifics needed to complete the task.
Agents can iterate on a problem for hours if they can see their results and be given a higher level goal to evaluate their progress toward.
When you have an agent working for minutes or hours, never wait on it. Use that time to spin up another agent.
You can also spin up several agents in parallel to attempt the same item of work and compare their results to choose which to work off for next steps, instead of rolling the dice on a single option at a time and gambling that it's better to refine that first attempt instead of retrying from the start several more times.
And if you are doing manual QA manually, you're missing out on having e.g. Codex's "Computer Use" or "Browser Use" automate your manual verification steps and collecting a report for you to review more quickly. Codex can control multiple virtual cursors simultaneously in the background without stealing focus, to parallelize this.
If you want to use up more tokens to get more done (though more outside of your control and ability to review of course), that's how.
My programming endurance is much greater now (2-3x focused hours per day), my productivity per hour is multiples higher, and I code seven days a week now because it's really exciting.
All told, I would pay for these tools as much as I would pay for full-time human programmer(s).
this is your “problem” - you are missing the “nightly” part. on my box LLMs run 24/7 :)
But if you are like me, you aggressively document and brainstorm before planning, you review that documentation with subagents, make modifications, you aggressively plan, you verify that plan with subagents,make modifications, have a large number of phases, planning again for each phase, writing tests to cover 100%, implement each phase, do intermediate and final code reviews with subagents, apply fixes, write final documentation and do all these in parallel, if you have multiple tabs in your terminal each running Claude Code for 10-12 hours a day, then $5000 per day is not much.
If you use Anthropic or Open AI subscription and you spend $1000 per month, you are not using AI much.
I'm building my own saas. I spent 6 months writing the code by hand before using Claude, and that was fine, but its much faster to give the exact specs to Claude and have 3-4 sessions working in parallel with me. When you validate changes with exact test specs there's much less correction you need to do. I always hit my weekly limit and it's far cheaper for me to use this than to hire someone and spend time onboarding them.
I'd much rather hire a junior engineer at $1.20/hour too! Can you hook me up with your contract services provider?
Obviously I know you're talking about AI costs only. But the idea of doing that analysis without looking at the salary of the person running the tool seems to be completely missing the point.
Now, sure, there are legitimate arguments to be made about efficacy and efficiency and sustainability and best practices. But, no, $100k/year absolutely doesn't need to be "justified" if it works. That's cheaper than the alternative, and markedly so.
If you're trying to say that 100k is less than 200k, you're right.
I don't see how any of that won't need to be justified. You can spend a lot of money and not get enough of a return...
You agree with me, basically.
The core point is that these very large AI bills are not actually large in context, as the pre-existing scale of expenses for software engineering are larger still and this at least promises to reduce those markedly.
To wit: argue about whether AI works[1] for software development, don't try to claim it's too expensive, it's clearly not.
[1] "Is justified" in the vernacular.
When people have no ability to understand what they are doing, they will just rerun it endlessly hoping they get something passable. When that doesn't happen they burn money.
“I’ve got 2 dozen agents churning through the backlog to build this feature that would take one agent an hour to implement.”
First , I interview people, Junior skills in manual coding dropped sharply this year. These are people who started they school manual and switched mid-course. In two years there will be no such people.
well, that will never happened anymore in this world unless we will go back to caves, especially for juniors. Junior that writes good code is already a dying unicorn.
The outcome will be ... you will hire a junior ... who will burn more tokens, and chances of mistakes with less expensive model, less tokens are even higher.
I mean even the normal people we get in interviews have no clue, like 80% are just ignorant.
I stoped an interview after 5 minutes: when i asked what ls -ahl is doing, he started telling me how he vibe/ai codes stuff and thats his workflow. Okay if you don't know the basics, guess what? everyone can replace you or at least i'm not hiring you (i only told him thats not what we are looking for and thanked him)
we are doomed :D
The bubble is an echo chamber.
> which means figuring out if the company can afford this level of productivity at scale.
If it was actually productive, then the revenue would increase and affordability wouldn't be a question.
Revenue has increased. Have you seen Meta's latest earnings? +33% revenue - in this economy.
Affordability is not a question. There is a reason companies like Meta have no issue with their engineers spending $1k/day on tokens. It's just not that much compared to how much they make per employee.
Well, that’s to be expected when using AI tools becomes relevant in your performance evaluation.
Management in the age of AI is falling for the doorman fallacy wrt engineering. If lines of code were the most valuable aspect of software engineering, my front end JavaScript intern would’ve been the most valuable person in the company. https://www.jaakkoj.com/concepts/doorman-fallacy
1. you sample a few to see that they are actually meaningful,
2. they go to prod and are validated without having to roll back.
Still needs to be managed. But it should be much easier for a manager to catch an engineer gaming PRs than something like AI use or lines of code.
Edit: y'all are some whiney folk, ain't ya?
And your response does not address the point being made in the comment you replied to: Many people are being evaluated by how many tokens they burn, which is about as good a metric as lines of code written.
If we're trying to measure the value of adopting tool, it's probably better to measure the ROI of that tool rather than the usage % of that tool, especially when usage is basically mandated.
To directly answer your questions:
1. You're being paid to create value for the business, which "doing what they think is productive" is a proxy for. You're not being paid to use a tool a high % of the time.
2. I doesn't seem like parent even commented on the quality of the code generated. I think anyone that uses it regularly can agree that: a) the code is not useless and b) all generated code is not immediately production ready c ) AI generation of code is an accelerant for software development
2) Mostly, yes.
At my previous company, when the thing they thought they wanted me to do (which was not the thing they actually wanted... but whatever) diverged from my values I quit. You can just do things.
> (2) Do you think all this AI generated code is useless?
Almost universally, yes. Especially in organizations that historically haven't been particularly careful about hiring and have a huge number of young, inexperienced people. There are exceptions but they're rare enough that throwing that particular baby out with the bathwater isn't a big loss.
1. At my level, the company is not just paying me to do a task the way they want it done, they are paying for my experience to orchestrate the best way to do it. They want an outcome, and I'm responsible for figuring out how to get to that outcome with the right balance of cost, correctness, etc. But yes, the most dystopian reality is what you said.
2. It's not useless, but the AI generated code is absolutely lower quality than what I would have written myself, but there is no desire to clean it up. Companies have always had a disastrously bad understanding of technical debt and they finally have tool they can shove down developers throats that trades even more velocity for even less quality. They're going to take that trade every single time.
This is the thing that boggles my mind. They spent their budget. They have 4 months of data. What do they have to show for it?
I'm not a hater; I'm not a luddite. I have a $200 Max plan and I use it.
But are you saying that Uber made this tool available, urged everybody to use it, and is confused about what happens when it worked? It's one thing if they decide AI isn't productive enough to be worth the cost.
Are they out of ideas on what to build next, or something?
I'm glad to see we've reached the point of AI discourse at which anything that might be construed as criticism must be prefixed by "I'm also part of the cult, I'm not a non-believer, but" to avoid being dismissed as a heretic.
[0]: https://finance.yahoo.com/sectors/technology/articles/ubers-...
The AI spend does not appear to be a significant chunk of R&D spending (0.3% in 4 months or 1% annualized). If they didn't plan for it, sure, it's not peanuts in the budget, but in context not that much.
The real question is, what did they get for that amount? The article claims that 70% of the code commit is now AI-generated, so presumably the code passed review and tests. Did it accelerate the feature count? did it reduce quality problems? Did it lead to other benefits?
Sadly the article is silent on the outcomes, besides the higher spend.
Maybe 4 months is too soon to assess the benefits. On the other hand, in an agile world ...
[1] https://www.unifygtm.com/insights-headcount/uber
1. You get out of it what you put into it. A savvy CTO might be incredibly excited by everything they can do with agents, and improperly think that all the software engineers can do the same thing, when in reality your org's average software engineers might not have the creativity to even think of many cases where it could save them work. So by mandating agent usage, you might find that productivity hasn't improved while AI costs have increased.
2. When using AI, there are two gaps that become more obvious. First is the gap of: who tells the agent what to do? In many orgs, product isn't technically savvy enough to come up with a detailed spec/plan that LLM can use. And many cog-in-machine developers aren't positioned to come up with the spec, they just want to implement it. By expecting work to be implemented by agent-using developers, you might instead find a lot of idle workers waiting for work to show up. Second is the qa/review cycle. You've introduced a big change to the org but are you really saving cost or shifting it?
I'm all for introducing LLM as optional to help existing developers increase velocity and quality, but I think the "let's restructure the org" movement is really dicey, especially for mid-size or smaller employers.
Beyond that, it's a force multiplier and it doesn't care if the force is positive or negative. Someone with poor software engineering principals can use AI to make an absolute mess quickly.
This infers value from spend, which makes no sense. Burning the budget tells us engineers like the tool, not that it's producing value.
Show me how to make two dollars whilst spending one, and budget isn't a problem.
At the same time the subscription will allow the same usage for hundreds of dollars a month.
Either Anthropic is absolutely hosing API users, massively subsidizing subscriptions, or a little bit of both.
"Cursor estimated last year that a $200-per-month Claude Code subscription could use up to $2,000 in compute, suggesting significant subsidization by Anthropic. Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute"
If 95% of people are using $100 of value a month, the whales may not be hurting them that badly.
They are getting you hooked on cheaper tokens, then raking you in when you get scale. I'm sure Uber gets a break on list price, but I doubt they are anywhere near <150 employee subscription pricing.
You can cap per user, but not having the rolling cap are you really just going to tell a member of your team “No AI for the rest of the month”
It’s a risky deal as it sets up now IMO.
Years ago I did work for a company that was spending over a million on Oracle product licenses and I was part of the consultant team they hired to rip it all out and just go for simple maintainable code based on open source products. Not only did it transform into a codebase that the average newly hired developer could maintain, you also had the savings of not paying Oracle a significant portion of your revenue.
I feel like that will repeat itself in a few years time with the current cloud and AI train everyone is on.
I haven't been in a professional setting for a while, I just code for fun nowadays so perhaps I'm somewhat out of the loop.
They gave up on self-driving, so that's not it.
If only. The optimizations they do on their matching algorithm has made the UX so terrible, I regularly use Lyft instead now.
Here's a much better article: https://aimagazine.com/news/why-uber-has-already-burned-thro...
Yes, productivity implies revenue (or cost reduction), and revenue is measurable.
However:
1. You spend money today to build features that drive revenue in the future, so when expenses go up rapidly today, you don’t yet have the revenue to measure.
2. It’s inherently a counterfactual consideration: you have these features completed today, using AI. You’re profitable/unprofitable. So AI is productive/unproductive, right? No. You have to estimate what you would’ve gotten done without AI, and how much revenue you would’ve had then.
3. Business is often a Red Queen’s race. If you don’t make improvements, it’s often the case that you’ll lose revenue, as competitors take advantage.
4. Most likely, AI use is a mixture of working on things that matter and people throwing shit against the wall “because it’s easy now.” Actually measuring the potential productivity improvements means figuring out how to keep the first category and avoid the second.
This isn’t me arguing for or against AI. It’s just me telling you not to be lazy and say “if it were productive you’d be able to measure it.”
Totally but new features in their app or better software are not going to increase Uber's revenue/profit significantly.
I think the prevailing (correct) consensus is that developer productivity is actually very hard to measure, and every time it is attempted the measure is immediately made a target making the whole thing pointless even if it had been a solid measurement- which it wasn't.
IDK where you're getting the idea here that measuring productivity of anyone who isn't a factory worker is easy.
See the second comment on this article. https://news.ycombinator.com/item?id=47976781
See @emp17344 responding to me.
We doubt the productivity because we have enough experience with Claude Code to know that flooding your organization with that many tokens isn't just unproductive, it's actively harmful.
That's...not exactly a lot per engineer. It sounds like they just didn't budget correctly. Especially if the net of that work is more features that would have otherwise required hiring more engineers, which would cost a lot more than $500 to $2000 a month.
And i'm not talking about some genies 10x developer who is working with multiply git worktrees on x tasks in parallel in high quality
That's a bit of a logical leap with no demonstrable increase in productivity.
All this shows is that they're spending a lot more on AI than they budgeted for. Nothing else.
You get what you measure.
Surprised Pikachu moment.
And it's going to become even more expensive when AI companies start charging to actually make a profit.
https://investor.uber.com/news-events/news/press-release-det...
or did the engineers just chill and let claude take over daily duties? (this is also a benefit for employees in my opinion)
Also wonder if there is some perverse incentive for models to be verbose to juice tokens.
Successfully burning through cash and tokens, alright, but what have they gotten out of it?
I wonder how this will end as AI becomes more expensive to use. If you can't quantify ROI then I guess you're cooked.
As a founder, the question I always have is "what is the marginal value per token relative to engineer-hours saved." More of a gut feel at the moment, but would be great to calculate.
They are using it to mean a mechanism that produces prodigious amounts of toxic waste. That does not conform to the historical understanding of the word.
How are they calculating that? They could be using my tool, Buildermark, but I do t think they are: https://buildermark.dev
... but the key fact about "$500-$2000" per engineer does not appear there, and seems to be fabricated.
I’ve been using all these tools since they started popping out around 2021 personally and professionally. I probably built four or five products at this point with assistance, not to mention the thousands and thousands of back-and-forth conversations for research or search or rubber ducking or whatever.
I have never spent more than whatever the professional max plan is that is consistently $20 a month.
I asked a friend of mine who spent a couple hundred dollars in like an few hours how they did it. The answer was they basically getting these agent groups of agents stuck in a loop and they’re constantly just generating verbose bullshit that is not even interrogated and doesn’t come out with any artifact that is inspectable no matter how expert you are.
The couple of stories I have heard of these massive crazy spends are people literally just assuming these things can complete an entire human task in one shot, so they continue to hit the “spin the wheel” button until they get something closer to what they want
But I’ve yet to see that actually work
and it actually flies in the face of every instruction guide or documentation or prompt engineering process that has been described over the last almost 5 years