Claude Opus 4.1

(anthropic.com)

813 points | by meetpateltech 1 day ago

35 comments

qsort 1 day ago
All three major labs released something within hours of each other. This anime arc is insane.
[-]
- Etheryte 22 hours ago
  This is why you have PR departments. Being on top of the HN front page, news sites, etc matters a lot. Even if you can't be the first, it's important to dilute the attention as much as possible to reduce the limelight your competitors get.
  [-]
  - paulryanrogers 18 hours ago
    "Prep the next three point releases now, but don't release any until I say so. None needs to be noticably better or even different, just has to have a higher number." -CEO of AI companies
  - andai 16 hours ago
    How do they know when it's time? Corporate espionage? Or do they just have Next Thing queued up months in advance and ready to go.
- x187463 1 day ago
  Given the GPT5 rumors, August is just getting started.
  [-]
  - kridsdale3 23 hours ago
    Given the Gregorian Calendar and the planet's path through its orbit, August is just getting started.
    [-]
    - rkuykendall-com 3 hours ago
      https://tintin.dlazaro.ca/month
    - MollyRealized 1 hour ago
      Given the majesty and nobility of HN commenters, augustness is just getting started.
    - tomrod 22 hours ago
      This legitimately made me chuckle.
    - wunderg 18 hours ago
      Good one, made my day
      [-]
      - aitchnyu 11 hours ago
        No, the rotation of the Earth around its axis did so.
        [-]
        Onewildgamer 6 hours ago
        Technically, it Earth's rotation gives us day and night. Doesn't move the calendar, which is through Earth's orbital revolution
        [-]
        selcuka 5 hours ago
        Exactly the GP's point (the rotation of the Earth "made their day").
  - teaearlgraycold 14 hours ago
    I'm all buckled up and ready for what's effectively GPT 4.6
- ozgung 1 day ago
  What a time to be alive
- tonyhart7 23 hours ago
  as if they wait competitor first then launch it at the same time to make market decide which one is best
  [-]
  - torginus 21 hours ago
    I think this means that GPT5 is better - you can't launch a worse model after the competitor supersedes you - you have to show that you're in the lead even if its just for a day.
    [-]
    - rapind 19 hours ago
      Not sure that this is true. Are there a lot of people waiting anxiously to adopt the next model on the day of release and expecting some huge work advantage?
      [-]
      - dnh44 18 hours ago
        If you’re using an LLM near the limits of what it can do then a small improvement in performance is noticeable.
      - rzz3 12 hours ago
        My coworkers/partners and I haven’t stopped talking about it for weeks. I’m one of them I guess, but we’ll see. The ARC graph I saw, if accurate, is really incredible.
      - azan_ 19 hours ago
        Absolutely.
- qoez 16 hours ago
  There's so many leakers in every lab
  [-]
  - goatlover 15 hours ago
    If only there were more leakers in the FBI or DOJ.
    [-]
    - anonym29 14 hours ago
      It's a risky game to leak the secrets of the gang that has a legal monopoly on violence.
      [-]
      - lan321 4 hours ago
        Slipped in the bathroom and hung himself on the shower curtains. Oh, what a shame.
      - euroderf 14 hours ago
        Sneakier is safer.
- j45 16 hours ago
  They likely sit on releases ready to go.
- candiddevmike 23 hours ago
  It's definitely a coincidence
  [-]
  - wilg 23 hours ago
    It's not a coincidence or a cartel, it's PR counterprogramming.
    [-]
    - brookst 5 hours ago
      Any source or just vibes?
      In my experience it take weeks if not months to coordinate a release, from testing to documentation to drafting press releases in multiple languages to benchmarks and website updates.
      I’m old and I’ve been in this industry most of my life. I have never once seen or heard of all of that work being done and the company just waiting on competitors before pulling the trigger.
    - BudaDude 19 hours ago
      Agree 100%
      If you look at the past, whenever Google announces something major, OpenAI almost always releases something as well.
      People forget realize that OpenAI was started to compete with Google on AI.
  - subarctic 13 hours ago
    But is it just a coincidence
- vFunct 23 hours ago
  None of them seem to have published any papers associated with them on how these new models advanced the state-of-the-art though. =^(
  [-]
  - hugodan 20 hours ago
    china will do that for them
- PeterStuer 4 hours ago
  Eu auto brands colluded for years to synchronize new tech into their model lines. Could it be the AI SaaS sector is showing its first steps towards "maturity"? /s
djha-skin 21 hours ago
Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, spinning in circles. OpenAI is better, but still falls short of Claude's performance. Claude also gives back 400's from its API if you CTRL-C in the middle though, so that's annoying.
Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.
1: https://openrouter.ai/anthropic/claude-opus-4.1
2: https://openrouter.ai/anthropic/claude-sonnet-4
3: https://block.github.io/goose/
4: https://openrouter.ai/anthropic/claude-3.5-sonnet
5: https://openrouter.ai/google/gemini-2.5-flash
6: https://openrouter.ai/openai/gpt-4.1-mini
[-]
- generalizations 21 hours ago
  Get a subscription and use claude code - that's how you get actual reasonable economics out of it. I use claude code all day on the max subscription and maybe twice in the last two weeks have I actually hit usage limits.
  [-]
  - drusepth 11 minutes ago
    Is there any documentation on what the max sub usage limit is? A coworker tried it and was booted off Opus within just a couple hours due to "high usage". I haven't made the jump since I expect my $3k/mo on API would just instantly fly by a $200/mo sub and then I'd just be back on API again, but if it could carve out $1k-2k of costs for a little bit of time managing sub(s) it might be worth it.
  - teruakohatu 15 hours ago
    > Get a subscription and use claude code
    I find the token/credit restrictions on Opus to be near useless even when using Claude Code. I only ever switch to it so get another model's take on the issue. Five minutes of use and I have hit the limit.
    [-]
    - ygouzerh 8 hours ago
      It seems for Opus the Max plan is almost always needed for being useful
    - closewith 11 hours ago
      Is it a max subscription?
      We have the $200 plans for work and despite only using Opus, we rarely hit the limits. CCUsage suggests the same via API would have been ~$2000 over the last month (we work 5 hours a day, 4 days a week, almost always with Claude).
      [-]
      - ngai_aku 4 hours ago
        Are you part time?
        [-]
        closewith 2 hours ago
        In a way. Those are my company's working hours.
    - Tostino 13 hours ago
      Yup. Getting to try three or so prompts that it messes up and then running out of quota for hours is entirely useless to me.
  - tgtweak 21 hours ago
    Is it considerably more cost effective than cline+sonnet api calls with caching and diff edits?
    Same context length and throughput limits?
    Anecdotally I find gpt4.1 (and mini) were pretty good at those agentic programming tasks but the lack of token caching made the costs blow up with long context.
    [-]
    - MarcelOlsz 17 hours ago
      If you use Claude Code with a subscription and run `ccusage` [0] you can get an idea of your "true usage" and cost.
      [0] https://github.com/ryoppippi/ccusage
    - bavell 20 hours ago
      I'm on the basic $20/mo sub and only ran into token cap limitations in the first few days of using Claude Code (now 2-3 weeks in) before I started being more aggressive about clearing the context. Long contexts will eat up tokens caps quickly when you are having extended back-and-forth conversations with the model. Otherwise, it's been effectively "unlimited" for my own use.
      [-]
      - bgirard 19 hours ago
        YMMV I'm using the $100/mo max subscription and I hit the limit during a focused coding session where I'm giving it prompts non-stop.
        Unfortunately there's no easy tool to inspect usage. I started a project to parse the Claude logs using Claude and generate a Chrome trace with it. It's promising but it was taking my tokens away from my core project.
        [-]
        bartman 19 hours ago
        Check out ccusage, it sounds like the tool you’re describing: https://github.com/ryoppippi/ccusage
        [-]
        bgirard 19 hours ago
        That's neat. According to the tool I'm consuming ~300m tokens per day coding with a (retail?) cost of ~125$/day. The output of the model is definitely worth $100/mo to me.
        [-]
        j45 3 hours ago
        This is a good bar to know. I see the warnings but not sure how much I really have left.
        Do you mostly use opus?
        j45 3 hours ago
        Neat tool thanks!
        symbolicAGI 19 hours ago
        ccusage on GitHub.
    - j45 16 hours ago
      Yes, it’s much better.
      It uses way less tokens or much more effectively when running locally.
  - seneca 19 hours ago
    Is there a way to sign up for Claude code that doesn't involve verifying a phone number with Anthropic? They don't even accept Google Voice numbers.
    Maybe I'm out of touch, but I'm not handing out my phone number to sign up for random SaaS tools.
    [-]
    - cma 15 hours ago
      It's maybe the leading subscription based tool in our field, not a random SaaS tool.
      [-]
      - what 13 hours ago
        They have zero need for a phone number.
        [-]
        senko 9 hours ago
        There are a lot of fraudsters out there who will happily create thousands of accounts with valid CCs that will fail on first actual charge.[0]
        I wouldn't be surprised if asking for a phone number lowers the fraud rate enough to compensate for the added friction.
        [0] Incidentally, this is also why many AI API providers ask for your money upfront (buy credits) unless you're big enough and/or have existing relationship with them.
        eddythompson80 10 hours ago
        Come on now. You're about to run their cli and let it send any random file on your machine to their API intentionally. Trust them a little.
    - tagami 19 hours ago
      use a burner
- killerstorm 6 hours ago
  Well, it's expensive compared to other models. But it's often much cheaper than human labor.
  E.g. if need a self-contained script to do some data processing, for example, Opus can often do that in one shot. 500 line Python script would cost around $1, and as long as it's not tricky it just works - you don't need back-and-forth.
  I don't think it's possible to employ any human to make 500 line Python script for $1 (unless it's a free intern or a student), let alone do it in one minute.
  Of course, if you use LLM interactively, for many small tasks, Opus might be too expensive, and you probably want a faster model anyway. Really depends on how you use it.
  (You can do quite a lot in file-at-once mode. E.g. Gemini 2.5 Flash could write 35 KB of code of a full ML experiment in Python - self-contained with data loading, model setup training, evaluation, all in one file, pretty much on the first try.)
- Aeolun 17 hours ago
  In every price comparison I make. Claude (API) always comes out cheapest if you manage to keep most of your context cached. 90% price reduction for input is crazy.
  [-]
  - cma 15 hours ago
    Cached prices: $.31 for Gemini Pro / Mtok, $1.50 for claude opus 4.1 / Mtok
    There's additional storage costs with google caching, around $3.75 for 5 minutes/Mtok, and Claude Opus is $3.75 for 5minute Cache Writes / Mtok.
    For cached reads Gemini Pro is 5X cheaper than Opus and like $0.01 more than Sonnet.
- energy123 17 hours ago
  Large models are for querying the model
  Small models are for querying the context
  Opus is cheap if you use it for its niche
  [-]
  - thimabi 16 hours ago
    > Large models are for querying the model
    > Small models are for querying the context
    I respectfully disagree.
    My experience is that large models are capable of understanding large contexts much better. Of course they are more expensive and slower, too. But in terms of accuracy, large models are always better at querying the context.
- kroaton 20 hours ago
  GLM 4.5 / Kimi K2 / Qwen Coder 3 / Gemini Pro 2.5
jzig 1 day ago
I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certain things while using Sonnet for others?
[-]
- SkyPuncher 1 day ago
  I don't doubt Opus is technically superior, but it's not practically superior for me.
  It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.
  For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.
  [-]
  - bdamm 20 hours ago
    I've been having a great time with Windsurf's "Planning" feature. Have a nice discussion with Cascade (Claude) all about what it is that neerds to happen - sometimes a very long conversation including test code. Then when everything is very clear, make it happen. Then test and debug the results with all that context. Pretty nice.
    [-]
    - SkyPuncher 3 hours ago
      This is basically what I do. I have a specific "planning mode" prompt I work through.
      It's very, very helpful. However, there are still a lot of problems I only discover/figure out after I've been working in the code.
    - jstummbillig 20 hours ago
      Can you explain what you do exactly? Do you enable plan mode and use with chat...?
      [-]
      - trenchpilgrim 18 hours ago
        In Zed I switch the AI panel to ask mode and chat with the agent about different approaches and have it draft patches. Then when I think there's a design worth trying, switch to Write mode and have it implement that change + run the tests and diagnostics to verify the code at least compiles, tests pass and follows our style guides. Finally a line by line review + review of the test coverage (in terms of interface surface area) before submitting a PR for another human review.
        [-]
        Larrikin 17 hours ago
        After watching a few videos trying to understand how people were using LLMs and getting useful results I found that even making a simpler version of the fancy planning mode in the LLM IDEs via the instructions.md produced hugely better productivity gains.
        I started adding an instruction file along the lines of "Always tell me your plan to solve the issue first with short example code, never edit files without explicit confirmation of your plan" at the start and it is like a day and night difference in how useful it becomes. It also starts to feel like programming again where you can read through various files and instead of thinking in your head, you write out your thoughts. You end up getting confirmation or push back on errors that you can clean up.
        Reading through a sort of wrong sort of right implementation spread across various files after every prompt just really sucked.
        I'm not one shotting massive amounts of files, but I am enjoying the lack of grunt work.
        [-]
        anjanb 13 hours ago
        Could you share some of the videos that you watched ? Can you make a video yourself ? That will help a lot of us.
  - ssk42 22 hours ago
    You can also always have it create design docs and mermaid diagrams for each task. Outline the why much easier earlier, shifting left
    [-]
    - SkyPuncher 3 hours ago
      That's essentially what I do, but that doesn't (and cannot) entirely solve the problem.
      A major part of software engineering is identifying and resolving issues during implementation. Plans are a good outline of what needs to be done, but they're always incomplete and inaccurate.
- adastra22 1 day ago
  Every time that Sonnet is acting like it has brain damage (which is once or twice a day), I switch to Opus and it seems to sort things out pretty fast. This is unscientific anicdata though, and it could just be that switching models (any model) would have worked.
  [-]
  - gpm 1 day ago
    This seems like a case of reversion to the mean. When one model is performing below average, changing anything (like switching to another model) is likely to improve it by random chance...
    [-]
    - keeeba 21 hours ago
      Anthropic say Opus is better, benchmarks & evals say Opus is better, Opus has more parameters and parameters determine how much a NN can learn.
      Maybe Opus just is better
      [-]
      - 8n4vidtmkvmk 13 hours ago
        Even if it's better on average, doesn't mean it's better for every possible query
  - monatron 1 day ago
    This is a great use case for sub-agents IMO. By default, sub-agents use sonnet. You can have opus orchestrate the various agents and get (close to) the best of both worlds.
    [-]
    - ondrsh 9 hours ago
      AFAIK subagents inherit the default model since v1.0.64. At least that's the case for me with the Claude Code SDK — not providing a specific model makes subagents use claude-opus-4-1-20250805.
    - rapind 19 hours ago
      In this case I don't think the controller needs to be the smartest model. I use sonnet as the main driver and pass the heavy thinking (via zen mcp) onto Gemini pro for example, but I could use openai or opus or all of them via OpenRouter.
      Subagents seem pretty similar to using zen mcp w/ OpenRouter but maybe better or at least more turnkey? I'll be checking them out.
      [-]
      - mark_undoio 18 hours ago
        Amp (ampcode.com) uses Sonnet as its main model and has GPT o3 as a special purpose tool / subagent. It can call into that when it needs particularly advanced reasoning.
        Interestingly I found that prompting it to ask the o3 submodel (which they call The Oracle) to check Sonnet's working on a debugging solution was helpful. Extra interesting to me was the fact that Sonnet appeared to do a better job once I'd prompted that (like chain of thought prompting, perhaps asking it to put forward an explanation to be checked actually triggered more effective thinking).
    - adastra22 23 hours ago
      Is there a way to get persistent sub-agents? I'd love to have a bunch of YAML files in my repository, one for each sub-agent, and have those automatically used across all Claude Code instances I have on multiple machines (I dev on laptop and desktop), or across the team.
      [-]
      - theshrike79 9 hours ago
        In my experience the best use for subagents is saving context.
        Example: you need to review some code to see if it has proper test coverage.
        If you use the "main" context, it'll waste tokens on reading the codebase and running tests to see coverage results.
        But if you launch an agent (a subprocess pretty much), it can use a "disposable" context to do that and only return with the relevant data - which bits of the code need more tests.
        Now you can either use the main context to implement the tests or if you're feeling really fancy launch another sub-agent to do it.
      - mwigdahl 22 hours ago
        Yep: https://docs.anthropic.com/en/docs/claude-code/sub-agents
        [-]
        adastra22 15 hours ago
        Thanks!
    - riwsky 16 hours ago
      Great, now even computers need to leave the IC track if they want continued career progression.
  - HarHarVeryFunny 23 hours ago
    Maybe context rot? If model's output seems to be getting worse or in a rut, then try just clearing context / starting a new session.
    [-]
    - adastra22 23 hours ago
      Switching models with the same context, in this case.
  - aghilmort 15 hours ago
    switching models great best practice whether get stuck or not
    can look at primal check the mean or dual get out of local minima
    in all cases, model, tokenizer, etc is just enough different that will generally pay off in spaces quickly
  - j45 1 day ago
    They both seem to behave differently depending on how loaded the system seems to be.
    [-]
    - api 1 day ago
      I have suspected for a long time that hosted models load shed by diverting some requests to lesser models or running more quantized versions under high load.
      [-]
      - parineum 1 day ago
        I think OpenRouter saves tokens by summarizing queries through another model, IIRC.
  - anonzzzies 1 day ago
    Exactly that.
- Uehreka 1 day ago
  > yet the general consensus and my own experience seem to be that Sonnet is much much better
  Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.
  I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.
  I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.
- MostlyStable 1 day ago
  Opus seems better to me on long tasks that require iterative problem solving and keeping track of the context of what we have already tried. I usually switch to it for any kind of complicated troubleshooting etc.
  I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.
  [-]
  - unshavedyak 1 day ago
    Same. I'm on the $200 plan and I find Opus "better", but Sonnet is more straight forward. Sonnet is, to me, a "don't let it think" model. It does great if you give it concrete and small goals. Anything vague or broad and it starts thinking and it's a problem.
    Opus gives you a bit more rope to hang yourself with imo. Yes, it "thinks" slightly better, but still not good enough to me. But it can be good enough to convince you that it can do the job.. so i dunno, i almost dislike it in this regard. I find Sonnet just easier to predict in this regard.
    Could i use Opus like i do Sonnet? Yes definitely, and generally i do. But then i don't really see much difference since i'm hand-holding so much.
- jm4 23 hours ago
  I use both. Sonnet is faster and more cost efficient. It's great for coding. Where Opus is noticeably better is in analysis. It surpasses Sonnet for debugging, finding patterns in data, creativity and analysis in general. It doesn't make a lot of sense to use Opus exclusively unless you're on a max20 plan and not hitting limits. Using Opus for design and troubleshooting and Sonnet for everything else is a good way to go.
- biinjo 1 day ago
  Im on the Max plan and generally Opus seems to do better work than Sonnet. However, that’s only when they allow me to use Opus. The usage limits, even on the max plan, are a joke. Yesterday I hit the limits within MINUTES of starting my work day.
  [-]
  - furyofantares 23 hours ago
    I'm a bit confused by people hitting usage limits so quickly.
    I use Opus exclusively and don't hit limits. ccusage reports I'm using the API-equivalent of $2000/mo
    [-]
    - rirze 23 hours ago
      You always have to ask which plan they're paying for. Sometimes people complain about the $20 per month plan...
      [-]
      - stavros 23 hours ago
        There's no Opus quota on that plan at all.
      - furyofantares 23 hours ago
        In this case I'm replying to someone who lead with "I'm on the Max plan" but I realize now that's ambiguous, maybe they are on 5x while I'm on 20x.
    - Bolwin 22 hours ago
      That's insane. Are you accounting for caching? If not, there's no way this is going to last
      [-]
      - furyofantares 22 hours ago
        I'm using ccusage to get the number, I think it just looks at your history and calculates based on tokens vs API pricing. So I think it wouldn't account for caching.
        But I totally agree there's no way it lasts. I'm mostly only using this for side projects and I'm sitting there interacting with it, not YOLO'ing, I do sometimes have two sessions going at the same time but I'm not firing off swarms or anything crazy. Just have it set to Opus and I chat with it.
        [-]
        Aeolun 17 hours ago
        Claude Code definitely reports cached tokens, and I think CCusage does too, so it wouldn’t make sense for the calculation to be based on full pricing when they have the cached values.
  - Aeolun 17 hours ago
    Is this on x5? Because ever since they booted all the freeloaders I’ve not once seen the “you are approaching usage limits” message. Anyway, the “you are approaching usage limits” message shows up when you are over 50% of your tokens for that timeframe, so it’s not sure useful.
  - epolanski 1 day ago
    Yeah, you need to actively cherry pick which model to use in order to not waste tokens on stuff that would be easily handed by a simpler model.
  - dsrtslnd23 21 hours ago
    same here constantly hit the Opus limits after minutes on Max plan
- dested 1 day ago
  If I'm using cursor then sonnet is better, but in claude code Opus 4 is at least 3x better than Sonnet. As with most things these days, I think a lot of it comes down to prompting.
  [-]
  - jzig 1 day ago
    This is interesting. I do use Cursor with almost exclusively Sonnet and thinking mode turned on. I wonder if what Cursor does under the hood (like their indexing) somehow empowers Sonnet more. I do not have much experience with using Claude Code.
- datameta 1 day ago
  I now eagerly await Sonnet 4.1, only because of this release.
- chisleu 11 hours ago
  I use opus or gemini 2.5 pro for plan mode and sonnet for act mode in Cline. https://cline.bot
  It's my experience that Opus is better at solving architectural challenges where sonnet struggles.
- sothatsit 18 hours ago
  Opus really shines for completing long-running tasks with no supervision. But if you are using Claude Code interactively and actively steering it yourself, Sonnet is good enough and is faster.
  I don't believe anyone saying Sonnet yields better results than Opus though, as my experience has been exactly the opposite. But trade-off wise, I can definitely see it being a better experience when used interactively because of its speed and lower cost.
- astrostl 23 hours ago
  With aggressive Claude Code use I didn't find Sonnet better than Opus but I did find it faster while consuming far fewer tokens. Once I switched to the $100 Max plan and configured CC to exclusively use Sonnet I haven't run into a plan token limit even once. When I saw this announcement my first thing was to CMD-F and see when Sonnet 4.1 was coming out, because I don't really care about Opus outside of interactive deep research usage.
- tkz1312 8 hours ago
  100% opus all the time. Sonnet seems to get confused much faster and need more hand holding in my experience.
- Aeolun 17 hours ago
  My opinion of Opus is that it takes the correct action 19/20 times, where Sonnet takes the correct action 18/20 times. It’s not strictly necessary to use Opus, but if you have the subscription already it’s just a pure win.
- sky2224 18 hours ago
  I've found with limited context provided in your prompt, opus is just awful compared to even gpt-4.1, but once I give it even just a little bit more of an explanation, it jumps leagues ahead.
- gpm 1 day ago
  I notice that on the "Agentic Coding" benchmark cited in the article Sonnet 4 outperformed Opus 4 (by 0.2%), and under performs Opus 4.1 (by -1.8%).
  So this release might change that consensus? If you believe the benchmarks are reflective of reality anyways.
  [-]
  - jimbo808 1 day ago
    > If you believe the benchmarks are reflective of reality anyways.
    That's a big "if." But yeah, I can't tell a difference subjectively between Opus and Sonnet, other than maybe a sort of placebo effect. I'm more careful to write quality prompts when using Opus, because I don't want to waste the 5x more expensive tokens.
- j45 3 hours ago
  Opus is superior to understand the big picture and the direction.
  Sonnet is great at banging it out.
- cmrdporcupine 4 hours ago
  Strategy I'm playing with, we'll see how good of results I get, is to prompt Opus to analyze and plan but not implement.
  E.g. prompt to read a paper, read some source, then write out a terse document meant to be read by machine not human.
  Then switch to Sonnet, have it read that document, and do the actual implementation work.
- brenoRibeiro706 1 day ago
  I feel the same way. I usually use Opus to help with coding and documentation, and I use Sonnet for emails and so on.
- rtfeldman 1 day ago
  Yes, Opus is very noticeably better at programming in both Rust and Zig in my experience. I wish it were cheaper!
- seunosewa 1 day ago
  It's ridiculously overpriced in the API. Just like o3 used to be.
- taormina 1 day ago
  Just more ancedata, but I entirely agree. I can't say that I am happy with Sonnet's output at any point, really, but it still occasionally works, whereas Opus has been a dumpster fire every single time.
- ssss11 20 hours ago
  That’s very strange. Sonnet is hot garbage and Opus is a miracle, for me. I also don’t see anyone praising sonnet anywhere.
gusmally 1 day ago
They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon
(He had been stuck in the Team Rocket hideout (I believe) for weeks)
[-]
- Lionga 4 hours ago
  The finest of AI, probably using electricity/water for 100s of homes can not even beat a very simple children game with millions of texts guides etc. about it.
  When can we replace doctors with it?
taormina 1 day ago
Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.
At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.
I've basically wasted the morning on Claude Code when I should've just been doing it all myself.
[-]
- AlecSchueler 20 hours ago
  I've also noticed Sonnet starting to degrade. It's developing some of the behaviours that put me off the competition in the first place. Needless explanations, filler in responses, wanting to put everything in lists, even increased sycophancy.
  [-]
  - ACCount36 4 hours ago
    Major AI companies are not doing nearly enough to address the sycophancy problem.
    I get that it's not an easy problem to solve, but how is Anthropic supposed to solve the actual alignment problem if they can't even stop their production LLMs from glazing the user all the time? And OpenAI is somehow even worse.
- Aeolun 17 hours ago
  I feel like this is just related to my projects getting bigger. Claude Code is trying to keep up with my project evolving from 2k lines of code to 100k lines. Of course it’s going to feel worse.
  [-]
  - satyrun 5 hours ago
    I think it is how our expectations of the latest model change over time.
    I expect to be completely blown away by GPT-5 in the first few days and then over time I will figure out the limitations of the model. Then I will be less impressed because you don't know what it can't do at first.
  - taormina 16 hours ago
    My project is basically the same size as when I started using it.
- bavell 20 hours ago
  > I've basically wasted the morning on Claude Code when I should've just been doing it all myself.
  Welcome to the machine
  https://www.youtube.com/watch?v=tBvAxSx0nAM&t=45s
- UncleEntity 17 hours ago
  Other than it starting out trying to produce a full and complete web app (or whatever) for my daily yak shaving session instead of the normal "let's talk about and work through this thing" the new Opus 4.1 seems to 'get it' a lot quicker than the old daffy robot did. It asked pertinent questions to understand the system we are working on and accomplished the goal of updating the design document so I don't have to keep explaining details at the start of every chat session. Something, by the way, it always previously failed to do causing me to have to explain stuff each and every time before forward progress could be made.
  I do agree it did hit the token limit a lot quicker than before where I could chat for hours without worrying about it.
  Either way, still have one last yak to shave for this project so we'll see how efficient it is with that. If it accomplishes the task before burning through all the tokens then win, win, I suppose.
thoop 23 hours ago
The article says "We plan to release substantially larger improvements to our models in the coming weeks."
Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.
I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.
haaz 1 day ago
it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference
[-]
- waynenilsen 1 day ago
  i think its probably mostly vibes but that still counts, this is not in the charts
  > Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.
  [-]
  - esafak 13 hours ago
    That is a big improvement.
- ttoinou 1 day ago
  That's why they named it 4.1 and not 4.5
  [-]
  - zamadatix 1 day ago
    When it's "that's why they incremented the version by a tenth instead of a half" you know things have really started to slow for the large models.
    [-]
    - phonon 23 hours ago
      Opus 4 came out 10 weeks ago. So this is basically one new training run improvement.
      [-]
      - zamadatix 23 hours ago
        And in 52 weeks we've gone 3.5->4.1 with this training improvement, meanwhile the 52 weeks prior to that were Claude -> Claude 3. The absolute jumps per version delta also used to be larger.
        I.e. it seems we don't get much more than new training run levels of improvement anymore. Which is better than nothing, but a shame compared to the early scaling.
        [-]
        globalise83 21 hours ago
        Is it really a bigger jump to go from plausible to frequently useful, than from frequently useful to indispensable?
        [-]
        zamadatix 20 hours ago
        Why is there supposed to be no step between frequently useful and indispensable? Quickly going from nothing to frequently useful (which involved many rapid hops between) was certainly surprising, and that's precisely the lost momentum.
    - mclau157 23 hours ago
      They released this because competitors are releasing things
- gloosx 23 hours ago
  They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.
- Topfi 21 hours ago
  I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.
- onlyrealcuzzo 21 hours ago
  I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.
- levocardia 22 hours ago
  "You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!
- leetharris 1 day ago
  Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.
- AstroBen 1 day ago
  I don't think this could even be called an improvement? It's small enough that it could just be random chance
  [-]
  - j_bum 1 day ago
    I’ve always wondered about this actually. My assumption is that they always “pick the best” result from these tests.
    Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.
steveklabnik 1 day ago
This is the bit I'm most interested in:
> We plan to release substantially larger improvements to our models in the coming weeks.
[-]
- machiaweliczny 1 day ago
  This is so people don't immediately migrate for GPT5
  [-]
TimMeade 19 hours ago
This has been the worse Claude day ever. Just fell apart. Not sure if the release is why, but cursing in documents and can not fix a bug after hours of back and forth.
[-]
- octo888 13 hours ago
  You're prompting it wrong !
tecleandor 8 hours ago
Is there any tool like Claude Code that can go into the same "automatic feedback and coding loop" (I don't know if it has an official name) but compatible with using different LLMs.
I've used Aider for a while, and I kind of liked if, but it felt like it needed way more manual work, and I also want to use different models, probably locally hosted. Haven't used Aider in 2 or 3 months, so I don't know if it already has evolved in that way...
edit: in the other hand, the automatic feedback loop means it sometimes go very crazy and the API costs skyrocket easily. But maybe that's another reason to run it locally.
[-]
- mitemte 4 hours ago
  Claude Code Router (https://github.com/musistudio/claude-code-router) lets you use Claude Code with other, non-Anthropic models.
- bananapub 5 hours ago
  opencode, but note that all the self-hosted llms are much worse at coding than claude code with opus/sonnet.
  there's also claude-code-proxy to make claude code use other models.
  [-]
  - ptrrrrrrppr 4 hours ago
    Claude code proxy technically works with openai but tool use breaks every now and then on o3-mini, making it unusable for me
minimaxir 1 day ago
This likely won't move the needle for Opus use over Sonnet while the cost remains the same. Using OpenRouter rankings (https://openrouter.ai/rankings) as a proxy, Sonnet 3.7 and Sonnet 4 combined generates 17x more tokens than Opus 4.
ryandrake 23 hours ago
Am I the only one super confused about how to even get started trying out this stuff? Just so I wouldn't be "that critic who doesn't try the stuff he criticizes," I tried GitHub Copilot and was kind of not very impressed. Someone on HN told me Copilot sucks, use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.
Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!
[-]
- AlecSchueler 23 hours ago
  I'm not sure what's complicated about what you're describing? They offer two models and you can pay more for higher usage limits, then you can choose if you want to run it in your browser or in your terminal. Like what else would you expect?
  Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?
  [-]
  - onlyrealcuzzo 23 hours ago
    When people post this stuff, it's like, are you also confused that Nike sells shoes AND shorts AND shirts, and there's different colors and skus for each article of clothing, and sometimes they sell direct to consumer and other times to stores and to universities, and also there's sales and promotions, etc, etc?
    It's almost as if companies sell more than one product.
    Why is this the top comment on so many threads about tech products?
    [-]
    - furyofantares 22 hours ago
      In this case, they tried something and were told they were doing it wrong, and they know there's more than one way to do it wrong - wrong model, wrong tool using the model, wrong prompting, wrong task that you're trying to use it for.
      And of course you could be doing it right but the people saying it works great could themselves be wrong about how good it is.
      On top of that it costs both money and time/effort investment to figure out if you're doing it wrong. It's understandable to want some clarity. I think it's pretty different from buying shoes.
      [-]
      - evilduck 22 hours ago
        > I think it's pretty different from buying shoes.
        Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
        Are you a construction worker, a banker, a cashier or a driver? Are you walking 5 miles everyday or mostly sedentary? Do you require steel toed shoes? How long are you expecting them to last and what are you willing to pay? Are you going to wear them on long runs or take them river kayaking? Do they need to be water resistant, waterproof or highly breathable? Do you want glued, welted, or stitch down construction? What about flat feet or arch support? Does shoe weight matter? What clothing are you going to wear them with? Are you going to be dancing with them? Do the shoes need a break in period or are they ready to wear? Does the available style match your preferences? What about availability, are you ok having them made to order or do you require something in stock now?
        By comparison I can try 10 different AI services without even needing to stand up for a break while I can't buy good dress shoes in the same physical store as a pair of football cleats.
        [-]
        kelnos 20 hours ago
        > Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
        Oh c'mon, now you're just being disingenuous, trying to make an argument for argument's sake.
        No, shoe shopping is not more complicated than trialing a LLM. For all of those questions about shoes you are posing, either a) a purchaser won't care and won't need to ask them, or b) they already know they have specific requirements and will know what to ask.
        With an LLM, a newbie doesn't even know what they're getting into, let alone what to ask or where to start.
        > By comparison I can try 10 different AI services without even needing to stand up for a break
        I can't. I have no idea how to do that. It sounds like you've been following the space for a while, and you're letting your knowledge blind you to the idea that many (most?) people don't have your experience.
        [-]
        yelirekim 7 hours ago
        It sounds like you're generally unfamiliar with using AI to help you at all? Or maybe you're also being disingenuous? It's insanely easy to figure this stuff out, I literally know a dozen people who are not even engineers, have no programming experience, who use these tools. Here's what Claude (the free version at claude.ai) said in response to me saying "i have no idea how to use AI coding assistants, can you succinctly explain to me what i need to do? like, what do i download, run, etc in order to try different models and services, what are the best tools and what do they do?":
        Here's a quick guide to get you started with AI coding assistants:
        ## Quick Start Options (Easiest)
        *1. Web-based (Nothing to Download)* - *Claude.ai* - You're here! I can help with code, debug, explain concepts - *ChatGPT* - Similar capabilities, different model - *GitHub Copilot Chat* - Web interface if you have GitHub account
        *2. IDE Extensions (Most Popular)* - *Cursor* - Full VS Code replacement with AI built-in. Download from cursor.com, works out of the box - *GitHub Copilot* - Install as VS Code/JetBrains extension ($10/month), autocompletes as you type - *Continue* - Free, open-source VS Code extension, lets you use multiple models
        *3. Command Line* - *Claude Code* - Anthropic's terminal tool for autonomous coding tasks. Install via `npm install -g @anthropic-ai/claude-code` - *Aider* - Open-source CLI tool that edits files directly
        ## What They Do
        - *Autocomplete tools* (Copilot, Cursor) - Suggest code as you type, finish functions - *Chat tools* (Claude, ChatGPT) - Explain, debug, design systems, write full programs - *Autonomous tools* (Claude Code, Aider) - Actually edit your files, make changes across codebases
        ## My Recommendation to Start
        1. Try *Cursor* first - download it, paste in some code, and ask it questions. It's the most beginner-friendly 2. Or just start here in Claude - paste your code and I can help debug, explain, or write new features 3. Once comfortable, try GitHub Copilot for in-line suggestions while coding
        The key is just picking one and trying it - you don't need to understand everything upfront!
        UncleEntity 16 hours ago
        Just play with the 'free tier' on whatever website does the AI thing and figure it out.
        Maybe there's a need to try ten different ones but I just stuck with one and can now convince it to do what I want it to do pretty successfully.
        UncleEntity 17 hours ago
        Ya know, in the over half a century I've been on this planet, choosing a new pair of shoes is so low on my 'life's little annoyances' list that it doesn't even rise above the noise of all the stupid random things which actually do annoy me.
        Maybe the problem is I don't take shoes seriously enough? Something to work on...
        [-]
        evilduck 2 hours ago
        You also learned about your shoe needs over the course of a lifetime. A caregiver gave you your first pair and you were expected to toddle around at most with them. You outgrew and replaced shoes as a child, were placed into new scenarios requiring different footwear as you grew up, learning and forming opinions about what's appropriate functionally, socially, economically as you went. You learned what stores were good for your needs, what brands were reputable, what styles and fits appealed to you. It took you more than a decade at minimum to achieve that.
        If you allow yourself to be a novice and a learner with AI and LLMs and don't expect to start out as a "shoe expert" where you never even think about this in your life and it's not even an annoyance, you'll find that it's the exact same journey.
        AlecSchueler 9 hours ago
        And in all the years that LLMs have been available I've yet to find a subscription plan confusing.
      - AlecSchueler 22 hours ago
        Is it though? People complain about sore feet and hear they wear the wrong kind of shoes so they go to the store where they have to spend money to find out while trying to navigate between dress shoes, minimal shoes, running shoes, hiking shoes etc etc., they have to know their size, ask for assistance in trying them on...
    - kelnos 20 hours ago
      Because the offerings are not simple. Your Nike example is silly; everyone knows what to do with shoes and shorts and shirts, and why they might want (or not want) to buy those particular items from Nike.
      But for someone who hasn't been immersed in the "LLM scene", it's hard to understand why you might want to use one particular model of another. It's hard to understand why you might want to do per-request API pricing vs. a bucketed usage plan. This is a new technology, and the landscape is changing weekly.
      I think maybe it might be nice if folks around here were a bit more charitable and empathetic about this stuff. There's no reason to get all gatekeep-y about this kind of knowledge, and complaining about these questions just sounds condescending and doesn't do anyone any good.
    - tomrod 22 hours ago
      > Why is this the top comment on so many threads about tech products?
      Because you overestimate the difference that the representative person understands.
      A more accurate analogy is that Nike sells green-blue shoes and Nike sells blue-green shoes, but the blue-green shoes add 3 feet to your jump and green-blue shoes add 20 mph to your 100 yard dash sprint.
      You know you need one of them for tomorrow's hurdles race but have no idea which is meaningful for your need.
      [-]
      - ryandrake 22 hours ago
        Also, the green-blue shoes charge per-step, but the blue-green shoes are billed monthly by signing up for BlueGreenPro+ or BlueGreenMax+, each with a hidden step limit but BlueGreenMax+ is the one that gives you access to the Cyan step model which is better; plus the green-blue shoes are only useful when sprinting, but the blue-green shoes can be used in many different events, but only through the Nike blue-green API that only some track&field venues have adopted...
    - gmueckl 22 hours ago
      When you walk into a store, you can see and touch all of these products. It's intuitive.
      With all this LLM cruft all you get is essentially the same old chat interface that's like the year 2000 called and wants its on-line chat websites back. The only thing other than a text box that you usually get is a model selector dropdown squirreled away in a corner somewhere. And that dropdown doesn't really explain the differences between the cryptic sounding options (GPT-something, Claude Whatever...). Of course this confuses people!
      [-]
      - derefr 22 hours ago
        Claude.ai, ChatGPT, etc. are finished B2C products. They're black boxes, encapsulated experiences. Consumers don't want to pick a model, or know what model they're using; they just want to "talk to AI", and for the system to know which model is best to answer any given question. I would bet that for these companies, if their frontend observes you using the little model override button, that gets instrumented as an "oops" event in their metrics — something they aim to minimize.
        What you're looking for, are the landing pages of the B2B API products underlying these B2C experiences. That would be https://www.anthropic.com/claude, https://openai.com/api/, etc. (In general, search "[AI company] API".)
        From those B2B landing pages, you can usually click through to pages with details about each of their models.
        Here's the model page corresponding to this news announcement, for example: https://www.anthropic.com/claude/opus
        (Also, note how these B2B pages are on the AI companies' own corporate domains; whereas their B2C products have their own dedicated domains. From their perspective, their B2C offerings are essentially treated as separate companies that happen to consume their APIs — a "reference use-case" — rather than as a part of what the B2B company sells.)
    - ryandrake 22 hours ago
      Hey, I'm open to the idea that I'm just stupid. But, if people in your target market (software developers) don't even understand your product line and need a HOWTO+glossary to figure it out, maybe there's also a branding/messaging/onboarding problem?
      [-]
      - DougBTX 20 hours ago
        My hot take is that your friend should show you what they’re using, not just dismiss Copilot and leave you hanging!
    - potatolicious 22 hours ago
      Eh, this seems like a take that reeks a bit of "everyone is stupid except me".
      I do know the answer to OP's question but that's because I pickle my brain in this stuff. It is legitimately confusing.
      The analogy to different SKUs strikes me also inaccurate. This isn't the difference between shoes, shirts, and shorts - it's more as if a company sells three t-shirts but you can't really tell what's different about them.
      It's Claude, Claude, and Claude. Which ones code for you? Well, actually, all of them (Code, web/desktop Claude, and the API can all do this)
      Which ones do you ask about daily sundry queries? Well, two of them (web/desktop Claude, but also the API, but not Code). Well, except if your sundry query is about a programming topic, in which case Code can also do that!
      Ok, if I do want to use this to write code, which one should I use? Honestly, any of them, and the company does a poor job of explaining why you would use each option.
      "Which of these very similar-seeming t-shirts should I get?" "You knob. How are posts like this even being posted." is just an extremely poor way to approach other people, IMO.
      [-]
      - ryandrake 22 hours ago
        > It's Claude, Claude, and Claude. Which ones code for you?
        Thanks for articulating the confusion better than I could! I feel it's a similar branding problem as other tech companies have: I'm watching Apple TV+ on my Apple TV software running on my Apple TV connected to my Google TV that isn't actually manufactured by Google. But that Google TV also has an Apple TV app that can play Apple TV+.
        [-]
        potatolicious 21 hours ago
        It's a bit worse than a branding problem honestly, since there's legitimate overlap between products, because ultimately they're different expressions of the same underlying LLMs.
        I'm not sure if you ever got a good rundown, but the tl;dr is that the 3 products ("Desktop", Code, and API) all expose the same underlying models, but are given different prompts, tools, and context management techniques that make them behave fairly differently and affect how you interact with them.
        - The API is the bare model itself. It has some coding ability because that's inherent to the model - you can ask it to generate code and copy and paste it for example. You normally wouldn't use this except that if you're using some Copilot-type IDE integration where the IDE is doing the work of talking to the model for you and integrating it into your developer experience. In that case you provide API key and the IDE does the heavy lifting.
        - The desktop app is actually a half-decent coder. It's capable of producing specific artifacts, distinguishing between multiple "files" it's writing for you, and revisiting previously-written code. "Oh, actually rewrite this in Go." is for example a thing it can totally do. I find it useful for diagnosing issues interactively.
        - "Claude Code" is a CLI-only wrapper around the model. Think of it like Anthropic's first-party IDE integration, except there's not an IDE, just the CLI. In this case the integration gives the tool broad powers to actually navigate your filesystem, read specific files, write to specific files, run shell commands like builds and tests, etc. These are all functions that an IDE integration would also give you, but this is done in a Claude-y way.
        My personal take is: try Claude Code, since as long as you're halfway comfortable with a CLI it's pretty usable. If you really want a direct IDE integration you can go with the IDE+API key route, though keep in mind that you might end up paying more (Claude Code is all-you-can-eat-with-rate-limits, where API keys will... just keep going).
        [-]
        ryandrake 21 hours ago
        Wow. After 50 replies to what I thought wasn't such a weird question, your rundown is the most enlightening. Thank you very much.
        [-]
        Karrot_Kream 21 hours ago
        FWIW it's probably because a lot of us have been following along and trying these things from the start so the nuances seem more obvious but also I feel that some folks feel your question is a bit "stupid", like "why are you suddenly interested in the frontier of these models? where were you for the last 2 years?"
        And to some extent it is like the PC race. Imagine going to work and writing software for whatever devices your company writes software for in whatever toolchain your company uses. Then 2-3 years after the PC race began heating up, asking "Hey I only really write code for whatever devices my employer gives me access to. Now I want to buy one of these new PCs but I don't really understand why I'd choose an Intel over a Motorolla chipset or why I'd prioritize more ROM or more RAM, and I keep hearing about this thing called RISC that's way better than CISC and some of these chips claim to have different addressing modes that are better?"
        Karrot_Kream 21 hours ago
        Also when it comes to API integrations, I find some better than others. Copilot has been pretty crummy for me but Zed's Agent Mode seems to be almost as good as Claude Code. I agree with the general take that Claude Code is a good default place to start.
        slackpad 21 hours ago
        Claude Code running in a terminal can connect to your IDE so you can review its proposed changes there. I’ve found this to be a nice drop in way to try it out without having to change your core workflow and tools too much. Check out the /ide command for details.
    - margalabargala 22 hours ago
      If anything, Anthropic has the product lineup that makes the most sense. Higher numbers mean better model. Haiku < Sonnet < Opus which translates to length/size. Free < Pro < Max.
      Contrast to something like OpenAI. They've got gpt4.1, 4o, and o4. Which of these are newer than one another? How do people remember which of o4 and 4o are which?
    - squeaky-clean 21 hours ago
      Which Nike shoe is best for basketball? The Nike Dunk, Air Force 1, Air Jordan, LeBron 20, LeBron XXI Prime 93, Kobe IX elite, Giannis Freak 7, GT Cut, GT Cut 3, GT Cut 3 Turbo, GT Hustle 3, or the KD18?
      At least with those you can buy whatever you think is coolest. Which Claude model and interface should the average programmer use?
      [-]
      - AlecSchueler 21 hours ago
        What's the average programmer? Is it someone who likes CLI tools? Or who likes IDE integration? Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.
        [-]
        squeaky-clean 4 hours ago
        The environment isn't the only difference, it's not "do you prefer CLI or IDE or Web" because they behave differently. Claude Code and Claude web and Claude through Cursor won't give you identical outputs for the same question.
        It's not like running a tool in your IDE or CLI where the only difference is the interface. It would be like if gcc ran from your IDE had faster compile times, but gcc run from the CLI gives better optimizations.
        The fact that no one is recommending any baseline to start with proves the point that it's confusing. And we haven't even touched on Sonnet v Opus
        nawgz 20 hours ago
        > Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.
        That's a silly claim to me, we're talking about a completely new environment where you prompt an AI to develop code, and therefore an "average programmer" is unlikely to have any meaningful experience or intuition with this flow. That is exactly what GP is talking about - where does he plug in the AI? What tradeoffs are there to different options?
        The other day I had someone judge me for asking this question by dismissively saying "dont say youve still been using ChatGPT and copy/paste", which made me laugh - I don't use AI at all, so who was he looking down on?
        [-]
        AlecSchueler 9 hours ago
        To me that's the silly argument. How many different tools have you ever used? New build system? New linter? How did you know if you wanted to run those on the command line or in your IDE?
        And it seems the story you shared sort of proves the point: the web interface worked fine for you and you didn't need to question it until someone was needlessly rude about it.
    - pdntspa 20 hours ago
      Because few seem to want to expend the effort to dive in and understand something. Instead they want the details spoonfed to them by marketing or something.
      I absolutely loathe this timeline we're stuck in.
    - true_religion 21 hours ago
      This is like being told to buy Nike shoes. Then when you proudly display your new cleats, they tell you "no, I meant you should by basketball shoes. The cleats are terrible."
    - Imustaskforhelp 22 hours ago
      Because I think that claude has gone beyond tech niche at this point..
      Or maybe that's me, but still whether its through the likes of those vibe coding apps like lovable bolt etc.
      at the end of the day, Most people are using the same tool which is claude since its mostly superior in coding (questionable now with oss models, but I still use it through kiro).
      People expect this stuff to be simple when in reality its not and there is some frustation I suppose.
    - hvb2 22 hours ago
      Not sure is this is sarcasm I'm assuming not.
      You're comparing well understood products that are wildly different to products with code names. Even someone who has never wore a t-shirt will see it on a mannequin and know where it goes.
      I'm sorry but I cannot tell what the difference is between sonnet and opus. Unless one is for music...
      So in this case you read the docs. Which is, in your analogy, you going to the Nike store and reading up on if a tshirt goes on your upper or lower body.
      [-]
      - AlecSchueler 9 hours ago
        Surely anyone interested in taking out a Claude subscription knows broadly what they're going to use an LLM for.
        It's more like going to the Nike store and asking about the difference between the Vaporfly 3 and the Pegasus 41. I know they're all shoes and therefore go on my feet, but I don't know what the difference is unless one is better for riding horses?
  - windsignaling 21 hours ago
    On the contrary, I'm confused about why you're confused.
    This is a well-known and documented phenomenon - the paradox of choice.
    I've been working in machine learning and AI for nearly 20 years and the number of options out there is overwhelming.
    I've found many of the tools out there do some things I want, but not others, so even finding the model or platform that does exactly what I want or does it the best is a time-consuming process.
- Filligree 23 hours ago
  You need Claude Pro or Max. The website subscription also allows you to use the command line tool—the rate limits are shared—and the command line tool includes IDE integration, at least for VSCode.
  Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.
  [-]
  - 47282847 18 hours ago
    > You need Claude Pro or Max.
    Actually, to try it out, prepaid token billing is fine. You are not required to have a subscription for claude code cli. Even just $5 gave me enough breathing room to get a feeling for its potential, personally. I do not touch code often these days so I was relieved not to have to subscribe and cancel again just to play around a little and have it write some basic scripts for me.
  - wahnfrieden 21 hours ago
    Correct. Claude Code Max with Opus. Don’t even bother with Sonnet.
    [-]
    - kelnos 20 hours ago
      I wouldn't be too prescriptive. I have Pro, and it's fine. I'm not an incredibly heavy user (yet?); I've hit the rate limits a couple times, but not to the point where I'm motivated to spend more.
      I haven't tried it myself, but I've heard from people that Opus can be slow when using it for coding tasks. I've only been using Sonnet, and it's performed well enough for my purposes.
      [-]
      - Filligree 18 hours ago
        Sonnet works fine in many cases. Opus is smarter, and custom 'agents' can be set to use either.
        I prefer configuring it to use Sonnet for things that don't require much reasoning/intelligence, with Opus as the coordinator.
      - wahnfrieden 17 hours ago
        Opus is slow, so sessions should be used in parallel, likely across work trees. You shouldn't sit and wait on an Opus agent.
- andsoitis 22 hours ago
  > use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.
  Anthropic has this useful quick start guide: https://docs.anthropic.com/en/docs/claude-code/quickstart
- prinny_ 23 hours ago
  What exactly did you try with GitHub copilot? It’s not an LLM itself, just in interface for an LLM. I have copilot in my professional GitHub account and I can choose between chat-gpt and Claude.
- vlade11115 23 hours ago
  Claude Code has two usage modes: pay-per-token or subscription. Both modes are using API under the hood, but with subscription mode you are only paying a fixed amount a month. Each subscription tier has some undisclosed limits, cheaper plans have lower usage limits. So I would recommend paying $20 and trying the Claude Code via that subscription.
  [-]
  - kace91 22 hours ago
    I’m looking for cursor alternatives after confusing pricing changes. Is Claude code an option? Can be integrated on an editor/ide for similar results?
    My use case so far is usually requesting mechanic work I would rather describe than write myself like certain test suites, and sometimes discovery on messy code bases.
    [-]
    - andyferris 19 hours ago
      Claude Code is really good for this situation.
      If you like an IDE, for example VS Code you can have the terminal open at the bottom and run Claude Code in that. You can put your instructions there and any edits it makes are visibile in the IDE immediately.
      Personally I just keep a separate terminal open and have the terminal and VSCode open on two monitors - seems to work OK for me.
  - dennisy 23 hours ago
    No Opus in the $20 tier though sadly
    [-]
    - andyferris 19 hours ago
      As far as I can tell - that seems to have changed today!
      [-]
      - andyferris 14 hours ago
        Actually I think I was wrong, the PR material was just vague about it.
    - oblio 23 hours ago
      What does Opus do extra?
      [-]
      - lxgr 22 hours ago
        It's a much larger, more capable LLM than Claude Sonnet.
        [-]
        oblio 10 hours ago
        I mean day to day. How is the coding experience different?
- lojack 15 hours ago
  Lets see: We have GitHub, and GitHub Enterprise Server, and a GitHub API. Then there's the command line and a desktop version, and one that is just browser based I guess. Then you have different pricing plans, Free, Team, and Enterprise? How is Enterprise different than GitHub Enterprise Server? It's very easy to find evidence to confirm our bias.
  Claude code is actually one of the most straightforward products I've used as far as onboarding goes. You download the tool, and follow the instructions. You can use one of the 3 plans, and everything else is automatic. You can figure out token usage and what models and versions to use and how to use MCP servers and all of that -- there's a lot of power -- but you don't need to do ANY of that to get started trying it out.
  You're not being:
  > That critic who doesn't try the stuff he criticizes
  You're being:
  > That critic who is trying to confirm their biases
- joshmarlow 22 hours ago
  VSCode has a pretty good Gemini integration - it can pull up a chat window from the side. I like to discuss design changes and small refactorings ("I added this new rpc call in my protobuf file, can you go ahead and stub out the parts of code I need to get this working in these 5 different places?") and it usually does a pretty darn good job of looking at surrounding idioms in each place and doing what I want. But gemini can be kind of slow here.
  But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.
  This workflow works well for exploring and understanding new topics and technologies.
  Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.
  Hope this helps!
- collinvandyck76 23 hours ago
  Claude Code is the superior interface in my opinion. Definitely start there.
- screye 20 hours ago
  Cursor + Claude 4 = best quality + UX balance. Pay up for 20/month subscription.
  Cursor imports in your VSCode setup. Setting it up should be trivial.
  Use Agent mode. Use it in a preexisting repo.
  You're off the races.
  There is a lot more you can do, but you should start seeing value at this point.
- kelnos 21 hours ago
  If you're looking for a coding assistant, get Claude Code, and give it a try. I think you need the Pro plan at a minimum for that ($20/mo; I don't think Free includes Claude Code). Don't do the per-request API pricing as it can get expensive even while just playing around.
  Agree that the offering is a bit confusing and it's hard to know where to start.
  Just FYI: Claude Code is a terminal-based app. You run it in the working directory of your project, and use your regular editor that you're used to, but of course that means there's no editor integration (unlike something like Cursor). I personally like it that way, but YMMV.
- wintermutestwin 22 hours ago
  Yes. You basically need an LLM to provide guidance on product selection in this brave new world.
  It is actually one of my most useful use cases of this tech. Nice to have a way to ask in private so you don’t get snarky answers like: it’s just like buying shoes!
- spaceman_2020 22 hours ago
  Download Claude Code
  Create a new directory in your terminal
  Open that directory, type in "Claude" to run Claude
  Press Shit + Tab to go into planning mode
  Tell Claude what you want to build - recommend something simple to start with. Specify the languages, environment, frameworks you want, etc.
  Claude will come up with a plan. Modify the plan or break it into smaller chunks if necessary
  Once plan is approved, ask it to start coding. It will ask you for permissions and give you the finished code
  It really is something when you actually watch it go.
- olalonde 23 hours ago
  Claude Code CLI.
  [-]
  - ryandrake 23 hours ago
    Thanks. With the CLI, can you get Copilot-ish things like tab-completion and inline commands directly in your IDE? Or do you need to copy/paste to and from a terminal? It feels like running a command on the IDE and then copying the output into your IDE is a pretty primitive way to operate.
    [-]
    - avemg 23 hours ago
      My advice is this:
      1) Completely separate in your mind the auto-completion features from the agentic coding features. The auto-completion features are a neat trick but I personally find those to be a bit annoying overall, even if they sometimes hit it completely right. If I'm writing the code, I mostly don't want the LLM autocompletion.
      2) Pay the $20 to get a month of Claude Pro access and then install Claude Code. Then, either wait until you have a small task in mind or your stuck on some stupid issue that you've been banging your head on and then open your terminal and fire up Claude Code. Explain to it in plain English what you want it to do. Pretend it's a colleague that you're giving a task to over Slack. And then watch it go. It works directly on your source code. There is no copying and pasting code.
      3) Bookmark the Claude website. The next time you'd Google something technical, ask it Claude instead. General questions like "how does one typically implement a flizzle using the floppity-do framework"? "I'm trying to accomplish X, what are my options when using this stack?". General questions like that.
      From there you'll start to get it and you'll get better at leverage the tool to do what you want. Then you can branch out the rest of the tool ecosystem.
      [-]
      - ryandrake 22 hours ago
        Interesting about the auto-completion. That was really the only Copilot feature I found to be useful. The idea of writing out an English prompt and telling Copilot what to write sounded (and still sounds) so slow and clunky. By the time I've articulated what I want it to do, I might as well have written the code myself. The auto-completion was at least a major time-saver.
        "The card game state is a structure that contains a Deck of cards, represented by a list of type Card, and a list of Players, each containing a Hand which is also a list of type Card, dealt randomly, round-robin from the Deck object." I could have input the data structure and logic myself in the amount of time it took to describe that.
        [-]
        avemg 22 hours ago
        I think you should embrace a bit of ambiguity. Don't treat this like a stupid computer where you have to specify everything in minute detail. Certainly the more detail you give, the better to an extent. But really: Treat it like you're talking to a colleague and give it a shot. You don't have to get it right on the first prompt. You see what it did and you give it further instructions. Autocomplete is the least compelling feature of all of this.
        Also, I don't remember what model Copilot uses by default, especially the free version, but the model absolutely makes a difference. That's why I say to spend the $20. That gives you access to Sonnet 4 which is where, imo, these models took a giant leap forward in terms of quality of output.
        [-]
        rstupek 20 hours ago
        Is Opus as big a leap as sonnet4 was?
        ryandrake 22 hours ago
        Thanks, I shall give it a try.
        stillpointlab 22 hours ago
        One analogy I have been thinking about lately is GPUs. You might say "The amount of time it takes me to fill memory with the data I want, copy from RAM to the GPU, let the GPU do it's thing, then copy it back to RAM, I might as well have just done the task on the CPU!"
        I hope when I state it that way you start to realize the error in your thinking process. You don't send trivial tasks to the GPU because the overhead is too high.
        You have to experiment and gain experience with agent coding. Just imagine that there are tasks where the overhead of explaining what to do and reviewing the output are dwarfed by the actual implementation. You have to calibrate yourself so you can recognize those tasks and offload them to the agent.
        potatolicious 22 hours ago
        There's a sweet spot in terms of generalization. Yes, painstakingly writing out an object definition in English just so that the LLM can write it out in Java is a poor use of time. You want to give it more general tasks.
        But not too general, because then it can get lost in the sauce and do something profoundly wrong.
        IMO it's worth the effort to know these tools, because once you have a more intuitive sense for the right level of abstraction it really does help.
        So not "make this very basic data structure for me based on my specs", and more like "rewrite this sequential logic into parallel batches", which might take some actual effort but also doesn't require the model to make too many decisions by itself.
        It's also pretty good at tests, which tends to be very boilerplate-y, and by default that means you skip some cases, do a lot of brain-melting typing, or copy-and-paste liberally (and suffer the consequences when you missed that one search and replace). The model doesn't tire, and it's a simple enough task that the reliability is high. "Generate test cases for this object, making sure to cover edges cases A, B, and C" is a pretty good ROI in terms of your-time-spent vs. results.
      - Pxtl 12 hours ago
        Is there any more agent-oriented approach where it just push/pulls a git repo like a normal person would, instead of running it on my machine? I'd like to keep it a bit more isolated and having it push/pull its own branches seems tidier.
    - cultureulterior 23 hours ago
      Claude does the coding, and edits your files. You just sit back and relax. You don't do any tab completion etc.
- robluxus 20 hours ago
  > I just want to putz around with something in VSCode for a few hours!
  I just googled "using claude from vscode" and the first page had a link that brought me to anthropic's step by step guide on how to set this up exactly.
  Why care about pricing and product names and UI until it's a problem?
  > Someone on HN told me Copilot sucks, use Claude.
  I concur, but I'm also just a dude saying some stuff on HN :)
- ActorNightly 22 hours ago
  If you want your own cheap IDE integration, you can set up VSCode with Continue extension, ollama running locally, and a small agent model. https://docs.continue.dev/features/agent/model-setup.
  If you want to understand how all of this works, the best way is to build a coding agent manually. Its not that hard
  1. Start with Ollama running locally and Gemma3 QAT models. https://ollama.com/library/gemma3
  2. Write a wrapper around Ollama using your favorite language. The idea is that you want to be able to intercept responses coming back from the model.
  3. Create a system prompt that tells the model things like "if the user is asking you to create a file, reply in this format:...". Generally to start, you can specify instructions for read file, write file, and execute file
  4. In your wrapper, when you send the input chat prompt, and get the model response back, you look for those formats, and make the wrapper actually execute the action. For example if the model replies back with the format to read file, you read the file from your wrapper code and send it back to the model.
  Every coding assistant is basically this under the hood with just a lot more fluff and their own IDE integration.
  The benefit of doing your own is that you can customize it to your own needs, and when you direct a model with more precision even the small models perform very well with much faster speed.
  [-]
  - afro88 21 hours ago
    OP is asking for where to get started with Claude for coding. They're confused. They just want to mess around with it in VSCode. And you start talking about Ollama, PAT, coding your own wrapper, composing a system prompt etc.!?
    [-]
    - ActorNightly 1 hour ago
      OP is trying to get LLMs to assist with coding. Implying that coding is something he is capable of, and coding your own wrapper is a great way to get familiarity with these systems.
- adamors 23 hours ago
  Download Cursor and try it through that, IMO that's currently the most polished experience especially since you can change models on the fly. For more advanced usecases, CLI is better but for getting your feet wet I think Cursor is the best choice.
  [-]
  - ryandrake 23 hours ago
    Thanks. Too bad you need to switch editors to go that path. I assume the Cursor monthly plans are not the same as the Claude monthly plans and you can't use one for the other if you want to experiment...
    [-]
    - kingnothing 22 hours ago
      Cursor is built on VSCode.
- StephenHerlihyy 22 hours ago
  Kilo Code for VSCode is pretty solid. Give it a try.
- jimbo808 21 hours ago
  You just described all of your options in detail - what's the problem? Pick one. Seems like you've got a very thorough grasp on how to get started trying the stuff out, but it requires you to choose how you want to do that.
- zarzavat 22 hours ago
  Github Copilot and Claude code are not exactly competitors.
  Github Copilot is autocomplete, highly useful if you use VS Code, but if you are using e.g. Jetbrains then you have other options. Copilot comes with a bunch of other stuff that I rarely use.
  Claude code is project-wide editing, from the CLI.
  They complement each other well.
  As far as I'm concerned the utility of the AI-focused editors has been diminished by the existence of Claude code, though not entirely made redundant.
  [-]
  - qingcharles 21 hours ago
    This isn't correct. GitHub Copilot now totally competes with Claude Code. You can have it create an entire app for you in "Agent" mode if you're feeling brave. In fact, seeing as Copilot is built directly into Visual Studio when you download it, I guess they have a one-up.
    Copilot isn't locked to a specific LLM, though. You can select the model from a panel, but I don't think you can plug in your own right now, and the ones you can select might not be SOTA because of that.
    [-]
    - zarzavat 9 hours ago
      I didn't mean it doesn't attempt to compete, I mean it doesn't actually compete. Claude code for agents, Copilot for autocomplete (depending on your editor/IDE).
      For single-line autocomplete, which is how I use it, pretty much anything will do the job. I use Copilot only because it integrates well with VS Code. I find the other features to be inferior.
    - alienbaby 19 hours ago
      Sonnet 4 in copilot agent mode has been doing great work for me lately. Especially once you realise that at least 50% of the work is done before you get to copilot, as architectural and product specs and implementations plans.
    - esafak 13 hours ago
      Is Copilot's Agent Mode any good, though?
      [-]
      - qingcharles 12 hours ago
        Ehhh... I wouldn't use it for anything important right now. It often screws up by truncating code files then asking itself "where did all those functions go?" and having to rewrite them from scratch.
        When it works, it's great though. I've used it to vibe-code some nice little desktop apps to automate things I needed and it produced way more polished UI than I would have spent the time doing, and the code is pretty much how I would have written it myself. I just set it going and go do some other task for 10 mins and come back to see what changes it made.
  - tomwojcik 20 hours ago
    Opencode https://github.com/sst/opencode provides a CC like interface for copilot. It's a slightly worse tool, but since copilot with Claude 4 is super cheap, I ended up preferring it over CC. Almost no limits, cheaper, you can use all the Copilot models, GH is not training on your data.
  - fkyoureadthedoc 22 hours ago
    > Github Copilot is autocomplete... comes with a bunch of other stuff that I rarely use.
    That bunch of other stuff includes the chat, and more recently "Agent Mode". I find it pretty useful, and the autocomplete near useless.
- w0m 20 hours ago
  honestly - copilot free mode; and just play with the agentic stuff can give you a good idea. Attach it to Roo and you'll get a good idea. Realize that if you paid to use a better model; you'd get better results as free doesn't have a ton of premium tokens.
- zaphirplane 20 hours ago
  try asking it ?
- fuckyah 22 hours ago
  [dead]
- vanillax 22 hours ago
  All the tools, copilot,claude, gemini in vscode are all completely worthless unless in Agent Mode. I have no idea why none of these tools dont default to Agent mode.
sbene970 3 hours ago
Opus 4.1 is now set as default model in Claude Code - just a heads-up.
mrcwinn 23 hours ago
o3 and o3-pro are just so good. Sonnet goes off the deep end too often and Opus, in my experience, is not as strong at reasoning compared to OpenAI, despite the higher costs. Rarely do we see a worse, more expensive product win - but competition is good and I’m rooting for Anthropic nonetheless!
[-]
- bayesianbot 16 hours ago
  OpenAI also has Flex processing[1] for o3. I've spent most of my time with Gemini 2.5, but lately been trying out a ton of o3 as it seems to work quite well and I get really cheap tokens (~95% of my agentic tokens are cached which is 75% discount and flex mode adds 50% for $0.25 / million input tokens)
  [1] https://platform.openai.com/docs/guides/flex-processing?api-...
  [-]
  - esafak 13 hours ago
    Which agents support flex mode?
    [-]
    - bayesianbot 13 hours ago
      I've made my own fork of Codex that always uses flex, or you can route agents through litellm and make it add the service_tier parameter. I haven't really seen native support for it anywhere.
- WXLCKNO 22 hours ago
  o3 feels pretty good to me as well but o3-pro has consistently one shotted problems other LLMs got stuck on.
  I'm talking multiple tries of claude 4 opus, Gemini 2.5 pro, o3 etc resulting in sometimes hundreds of lines of code.
  Versus o3-pro (very slowly) analyzing and then fixing something that seemed completely unrelated in a one or two line change and truly fixing the root cause.
  o3-pro level LLMs at reduced cost and increased speed will already be amazing..
- AlecSchueler 23 hours ago
  Off the deep end?
  [-]
  - derwiki 14 hours ago
    It picks a bad path forward and keeps doubling down on it
  - UncleEntity 16 hours ago
    Probably referring to it's tendency to over-complicate things to the point you have to step in and be like "WTF are you even talking about... Wouldn't it be a lot simpler to just use the original, well planned out design?"
    Which it does a lot...
NitpickLawyer 1 day ago
Cheekily announcing during oAI's oss model launch :D
[-]
paxys 1 day ago
Why is everything releasing today?
[-]
- highfrequency 23 hours ago
  If they release before GPT-5, they don't have to compare to GPT-5 in their benchmarks. It's a big PR win to be able to plausibly claim that your model is the best coding model at the time of release.
- datameta 1 day ago
  Could it be nobody wanted to be first and overshadowed, nor the only one left out - and it cascaded after the first announcement? My first hunch, though, was that it had been agreed upon. Game theory I think tells us that releasing same day in the pattern ABC BCA CAB etc would be lowest risk and highest average gain?
taoh 7 hours ago
Have been using it in Claude Code with Max Plan for one day. The rate of acceptance is noticeably higher.
P24L 22 hours ago
The improved Opus isn’t about achieving significantly better peak performance for me. It’s not about pushing the high end of the spectrum. Instead, it’s about consistently delivering better average results - structuring outputs more effectively, self-correcting mistakes more reliably, and becoming a trustworthy workhorse for everyday tasks.
jasonlernerman 1 day ago
Has anyone tested it yet? How's it acting?
[-]
- jedisct1 1 day ago
  Tested it on a refactor of Zig code. It worked fine, but was very slow.
- usaar333 1 day ago
  No obvious gains I feel from quick chats, but too early to tell.
  These benchmark gains aren't that high, so I doubt it is that obvious.
- smerrill25 1 day ago
  waiting for this, too.
_vaporwave_ 23 hours ago
It's interesting that Anthropic maintains current prices for prior state of the art models when doing a new release. Why offer a model with worse performance for the same price? What incentives are they trying to create?
[-]
- gwd 20 hours ago
  > What incentives are they trying to create?
  One obvious explanation is that pricing is strongly related to the price to them, and that their only incentive is for people to use an expensive model of they really need it.
  I forget which one of the GPT models was better, faster, and cheaper than the previous model. The incentive there is obviously, "If you want to use the old model for whatever reason, fine, but we really want you to use the new one because costs us less to run."
- dysoco 23 hours ago
  I'm guessing it's mostly for legacy reasons. When 3.7 came out many people were not happy with it and went back to 3.5; I guess supporting older models for a while makes sense.
alvis 21 hours ago
Funny Open AI and Anthropic seems to be coordinating their releases on the same day
alrocar 1 day ago
just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/
[-]
- epolanski 1 day ago
  How does running it multiple times performs?
  LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.
mocmoc 22 hours ago
Their limits are just … a real road blocker
[-]
- bananapub 22 hours ago
  huh?
  Claude Mad is tens of hours of opus a month, or you can pay per token and have unlimited.
  Or did you mean “I wish it was cheaper”?
  [-]
  - andyferris 18 hours ago
    Ha - the $200 plan should be renamed to "Claude Mad Max" :)
OldGreenYodaGPT 22 hours ago
Claude Code has honestly made me at least 10x more productive. I’ve burned through about 3 billion tokens and have been consistently merging 5+ PRs a day, tackling tons of tech debt, improving GitHub Actions, and making crazy progress on product work
[-]
- AstroBen 22 hours ago
  only 10x? I'm at least 100x as productive. I only type at a measly 100wpm, whereas Claude can output 100+ tokens a second
  I'm outputting a PR every 6 minutes. The reviewers are using Claude to review everything. It used to take a day to add 100 lines to the codebase.. now I can add 100 lines in one prompt
  If I want even more productivity (at risk of making the rest of my team look slow) I can tell Claude to output double the lines and ship it off for review. My performance metrics are incredible
  [-]
  - samtp 22 hours ago
    So no human reads the actual code that you push to production? Are you not worried about security risks, spaghetti code, and other issues? Or does Claude magically make all of those concerns go away?
    [-]
    - AstroBen 21 hours ago
      forgot the /s
      [-]
      - samtp 21 hours ago
        Sorry lol, sometimes difficult to separate the hype boys from actual sarcasm these days
  - qingcharles 20 hours ago
    Not sure if joking...?
    [-]
    - AstroBen 20 hours ago
      This is only the beginning. I can see myself having 100 Claude tasks running concurrently - the only problem is edits clash between files. I'm working on having Claude solve this by giving each instance its own repo to work with, then I ask the final Claude to mash it all together as best it can
      What's 100x productivity multiplied by 100 instances of Claude? 10,000x productivity
      Now to be fair and a bit more realistic it's not actually 10000x because it takes longer to push the PR because the file sizes are so big. Let's call it 9800x. That's still a sizable improvement
  - trallnag 19 hours ago
    Big if true
- steinvakt2 22 hours ago
  I also have this feeling that I'm 2-10x more productive. But isn't it curious how a lot of devs feel this way, but no devs that I know have the experience that any of their colleagues have become 2-10x more productive?
  [-]
  - mwigdahl 21 hours ago
    <raises hand> Our automated test folks were chronically behind, struggling to keep up with feature development. I got the two assigned to the team that was the most behind set up with Claude Code. Six weeks later they are fully caught up, expanding coverage, and integrating AI code review into our build pipeline.
    It's not 10x, but those guys do seem like they've hit somewhere around 2x improvement overall.
  - nevertoolate 22 hours ago
    10x means to me that i can finish a month of work in max 2 days and go cloud watching. What does it mean for you?
    [-]
    - NitpickLawyer 12 hours ago
      Sometimes 10x can mean that I start things that I would have never started before, knowing it would take a long time. Or that I can have any of the agentic stuff "explore" libs, stacks and frameworks I wanted to look at, but had no time. Or distill some vague docs and blog posts to find common use cases for tech x. And so on.
      It's not always a literal 10x time for taskA w/ AI vs taskA w/o AI...
    - ares623 14 hours ago
      A 60 minute script becomes 6 minutes
- samtp 22 hours ago
  What type of work do you do and what type of code do you produce?
  Because I've found it to work pretty amazingly for things that don't need to be exact (like data modeling) or don't have any security implications (public apps). But for everything else I end up having to find all the little bugs by reading the code line by line, which is much slower than just writing the code in the first place.
- screye 19 hours ago
  How do you maintain high confidence in the code it generates ?
  My current bottleneck is having to review the huge amounts of code that these models spit out. I do TDD, use auto-linting and type-checking.... but the model makes insidious changes that are only visible on deep inspection.
  [-]
  - sneak 11 hours ago
    You have to review your code for quality and bugs and errors now just as you did last month or last year. Did you never write bugs accidentally before?
    We're all bottlenecked on reviewing now. That's a good thing.
    [-]
    - screye 3 hours ago
      There was a greater awareness of exactly what I'd written. By definition, I would not have written those bugs in, as long as I had known edge cases in my mind.
      Lapses of judgement and syntax errors happen, but they're easier to spot because you know exactly what you're looking at. When code is written by a model, I have to review it 3 times.
      1st to understand the code. 2nd to identify lapses in suspicious areas. 3rd to confirm my suspicions through interactive tests, because the model can use patterns I'm unfamiliar with, and it takes me some googling to confirm if certain patterns used by the model are outright bugs or not. The biggest time sink is fixing an identified bug, because now you're doing it in someone-else's (model's) legacy code rather than a greenfield feature implementation.
      It's a big productivity bump. But, if reviewing is the bottleneck, then that upper bounds the productivity gains at ~4x for me. Still incredible technology, but the death of software-engineering that it is claimed to be.
- theappsecguy 19 hours ago
  The only way you could be 10x more productive is omit you were doing nothing before.
- totaa 22 hours ago
  can you share your workflow?
hartator 20 hours ago
> 1 min read
What the point of these?
Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".
It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.
system2 16 hours ago
Claude lost me after I used it for a day. Their pricing model is bonkers. There is no way any developer in their right mind would go with Claude.
[-]
- thirdacc 16 hours ago
  Their API pricing is bonkers, their subscription is a great deal for what you get
paul7986 21 hours ago
Claude plus failed me today badly compared to chatGPT plus.
I uploaded a web design of mine (jpeg) and asked Claude to create the html/css. Asked GPT to do the same. GPT's code looked the closet to the design I created and uploaded. Just five to ten small tweaks and I was done vs. Claude it would have taken me almost triple the steps.
I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).
ramesh31 23 hours ago
Will the price for 4 go down? I still find Opus completely unusable for the cost/performance, as someone who spends thousands per month on tokens. There's really no noticeable difference from Sonnet, at nearly 10x the price.
camillomiller 5 hours ago
Well wait another 24hrs…
sneak 11 hours ago
Is it just me, or is Opus 4.1 substantially worse in Claude Code than Opus 4.0 was? I feel like I'm using Sonnet.
It's making really stupid errors and I have to work three times as much to get the same results as last week.
jedisct1 1 day ago
Is it just me or is it super slow?
cailai 10 hours ago
[dead]
rvz 23 hours ago
Notice how Anthropic has never open sourced any of their models.
This makes them (Anthropic) worse than OpenAI in terms of openness.
Since in this case as we all know. [0]
"What will permanently change everything is open source and transparent AI models that are smaller and more powerful than GPT-3 or even GPT-4."
[0] https://news.ycombinator.com/item?id=34865626
[-]
- jjani 23 hours ago
  On the other hand, they have always exposed their raw chain of thought, so you know exactly what you're paying for, unlike OpenAI who hides it. Similarly they allow an actual thinking budget rather than vague "low, medium, high", again unlike OpenAI. They also allow API access to all their models without draconic send-us-your-personal-data-KYC, once more unlikely OpenAI.
  They might not fit your personal definition of "openness", but they do fit many other equally valid interpretations of that contept.
KaoruAoiShiho 21 hours ago
For me this is the big news of the day. Looks insane.