Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

(anthropic.com)

1454 points | by weirdcat 253 days ago

132 comments

LASR 253 days ago
This is actually a huge deal.
As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
[-]
- aduffy 253 days ago
  This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.
  Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
  [-]
  - CSMastermind 253 days ago
    RPA has been a huge pain to work with.
    It's almost always a framework around existing tools like Selenium that you constantly have to fight against to get good results from. I was always left with the feeling that I could build something better myself just handrolling the scripts rather than using their frameworks.
    Getting Claude integrated into the space is going to be a game changer.
    [-]
    - visarga 253 days ago
      Most RPA work is in dealing with errors and exceptions, not the "happy path". I don't see how Claude's Screen Agent is going to work out there - what do you do when an error pops up and you need to implement specific business logic how to respond? How about consistency over many executions, and enterprise accounts. You want a centralized way to control agent behavior. Scripting based RPA is also much faster and cheaper to run, and more consistent.
      Maybe Anthropic should focus on building a flexible RPA primitive we could use to make RPA workflows with, like for example extracting values from components that need scrolling, selecting values from long drop-down menus, or handling error messages under form fields.
      [-]
      - throwaway2037 252 days ago
        I agree with your post.
        > Most RPA work is in dealing with errors and exceptions, not the "happy path".
        Isn't this most programming? I always chuckle when a junior hire looks at my code and says: "It is mostly error checking."
      - eastendguy 252 days ago
        100% this. I am using the open source Ui.vision to automate some business tasks. Works well, but only 10% of the work is for automating the main workflow, 90% of the work goes into error and edge case handling (e. g. Internet down, website (to scrape data from) down, some input data has typos or the wrong date format, etc).
        A human can work around all these error cases once she encounters them. Current RPA tools like Uipath or ui.vision need explicit programming for every potential situation. And I see no indication that Claude is doing any better than this.
        For starters, for visual automation to work reliably the OCR quality needs to improve further and be 100% reliable. Even in that very basic "AI" area, Claude, ChatGPT, Gemini are good, but not good enough yet.
    - xxpor 253 days ago
      I can see it now, Claude generating expect scripts. 1994 and 2024 will be fully joined.
      [-]
      - falcor84 253 days ago
        The big thing I expect at the next level is in using Claude to first generate UI-based automation based on an end user's instructions, then automatically defining a suite of end-to-end tests, confirming with the user "is this how it should work?", and then finally using this suite to reimplement the flow from first principles.
        I know we're still a bit far from there, but I don't see a particular hurdle that strikes me as requiring novel research.
        [-]
        SoftTalker 253 days ago
        But does it do any better at soliciting the surprise requirements from the user, who after confirming that everything works, two months later reports a production bug because the software isn't correctly performing the different reqirements on the first Tuesday of each quarter that you never knew about.
        [-]
        mrbombastic 252 days ago
        I once had an executive ask to start an incident because he was showing a client the app and a feature that he wanted that had never been spec’d didn’t exist.
        [-]
        throwaway2037 252 days ago
        So basically, Tog's Paradox in action?
  - creativenolo 253 days ago
    I was going to comment about this. Worked at a place that had a “Robotics Department”, wow I thought. Only to find out it was automating arcane software.
    UI is now much more accessible as API. I hope we don’t start seeing captcha like behaviour in desktop or web software.
    [-]
    - bonoboTP 252 days ago
      Wow, that's a grim potential future. I can already see software producers saying that e.g. the default license only allows operation of our CAD designer software by a human operator. If you want to make your bot use it in an automated way you must by the bot license which costs 10x more.
  - monkeydust 253 days ago
    Exactly. I have been wondering for a while how GenAI might upend RPA providers guess this might be the answer.
    [-]
    - arach 253 days ago
      I've been wondering the same and started exploring building a startup around this idea. My analysis led me to the conclusion that if AI gets even just 2 orders of magnitude better over the next two years, this will be "easy" and considered table stakes. Like connecting to the internet, syncing with cloud or using printer drivers
      I don't think there will be a very big place for standalone next gen RPA pure plays. it makes sense that companies that are trying to deliver value would implement capabilities this. Over time, I expect some conventions/specs will emerge. Either Apple/Google or Anthropic/OpenAI are likely to come up with an implementation that everyone aligns on
      In other words, I agree
      [-]
      - Eisenstein 252 days ago
        > if AI gets even just 2 orders of magnitude better over the next two years
        You realize this means '100 times better', right?
        [-]
        arach 252 days ago
        yes, thanks for pointing out the assumption here. I'm not sure how to quantify AI improvements and tbh not really up to speed on quantifiable rate of improvement from 4 to 4o to o1
        100 times better seems to me in line with the bet that's justifying $250B per annum in Cap Ex (just among hyperscalers) but curious how you might project a few years out?
        Having said that, my use of 100x better here applies to 100x more effective at navigating use cases not in training set, for example, as opposed to doing things that are 100x more awesome or doing them 100x more efficiently (though seemingly costs, context window and token per unit of electricity seem to continue to improve quickly)
        [-]
        Eisenstein 252 days ago
        I would think that such an increase in AI capability would basically be AGI...
        Just to give a few comparisons, the following things are two orders of magnitude apart:
        1. The force felt by a mosquito landing on your arm and getting punched by Mike Tyson in his prime
        2. A firecracker exploding and a stick of dynamite exploding
        3. The heat from a candle and the heat from a blowtorch
        4. The sound from a whisper and the sound from jet engine
  - iwontberude 253 days ago
    UiPath can't figure out how to make a profitable business since 2005 and we are nearing the end of this hype cycle. I am not so sure this will lead anywhere. I am a former investor in UiPath.
    [-]
    - zukzuk 252 days ago
      Attempts at commercialization in technology seem to often happen twice. First we get the much-hyped failure, and only later we get the actual thing that was promised.
      So many examples come to mind… RealVideo -> YouTube, Myspace -> Facebook, Laserdisc -> DVD, MP3 players -> iPod…
      UiPath may end up being the burned pancake, but the underlying problem they’re trying to address is immensely lucrative and possibly solvable (hey if we got the Turing test solved so quickly, I’m willing to believe anything is possible).
      [-]
      - iwontberude 252 days ago
        I love the “burned pancake” euphemism. Totally going to borrow this.
    - voidmain0001 253 days ago
      It didn’t help that UIPath forced a subscription model and “cloud orchestrator” on all users and many of which needed neither. They got greedy. We ditched it.
  - eitally 252 days ago
    My impression is that actually solving this classic RPA problem with AI is exactly the raison d'etre of AI21Labs with their task specific models[1]. They don't have the biggest or best general purpose LLM, but they have an excellent model that's been pre-trained on specific types of business data and also made available for developers using simple APIs & "RPA-style" interfaces.
    [1] https://www.ai21.com/use-cases
  - tkellogg 253 days ago
    Honestly, this is going to be huge for healthcare. There's an incredible amount of waste due to incumbent tech making interoperability difficult.
    [-]
    - voidmain0001 253 days ago
      Hopefully.
      I’ve implemented quite a few RPA apps and the struggle is the request/response turn around time for realtime transactions. For batch data extract or input, RPA is great since there’s no expectation of process duration. However, when a client requests data in realtime that can only be retrieved from an app using RPA, the response time is abysmal. Just picture it - Start the app, log into the app if it requires authentication (hope that the authentication's MFA is email based rather than token based, and then access the mailbox using an in-place configuration with MS Graph/Google Workspace/etc), navigate to the app’s view that has the data or worse, bring up a search interface since the exact data isn’t known and try and find the requested data. So brittle...
    - claytongulick 253 days ago
      It is.
      CTO of healthcare org here.
      I just put a hold on a new RPA project to keep an eye on this and see how it develops.
      According to their docs, Anthropic will sign a BAA.
      [-]
      - dimitri-vs 253 days ago
        Out of curiosity, how are high risk liability enviroments like yours coming to terms with the non-deterministic nature of models like these? Eg. the non-zero chance that it might click a button it *really* shouldn't as demonstrated in the failure demo.
        [-]
        stavros 253 days ago
        Technical director at another company here: We have humans double-check everything, because we're required by law to. We use automation to make response times faster, or to do the bulk of the work and then just have humans double-check the AI. To do otherwise would be classed as "a software medical device", which needs documentation out the wazoo, and for good reason. I'm not sure you could even have a medical device where most of your design doc is "well I just hope it does the right thing, I guess?".
        Sometimes, the AI is more accurate or safer than humans, but it still reads better to say "we always have humans in the loop". In those cases, we reap the benefits of both: Use the AI for safety, but still have a human fallback.
        [-]
        distalx 252 days ago
        I'm curious, what does your human verification process look like? Does it involve a separate interface or a generated report of some kind? I'm currently working on an tool for personal use, that records actions and triggers them at later stage on when specified event occurs. For verification, generating a CSV report after the process is complete and backing it up with screen recordings.
        [-]
        stavros 252 days ago
        It's a separate interface where the output of the LLM is rated for safety, and anything unsafe opens a ticket to be acted upon by the medical professionals.
        claytongulick 253 days ago
        I don't know yet. We may not do it.
        We haven't deployed a model like this, it's new.
        I've done a ton of various RPAs over the years, using all the normal techniques, and they're always brittle and sensitive to minor updates.
        For this, I'm taking a "wait and see" approach. I want to see and test how well it performs in the real world before I deploy it, and wait for it to come out of beta so Anthropic will sign a BAA.
        The demo is impressive enough that I want to give the tech a chance to mature before my team and I invest a ton of time into a more traditional RPA.
        At a minimum, if we do end up using it, we'll have solid guard rails in place - it'll run on an isolated VM, all of its user access will be restricted to "read only" for external systems, and any content that comes from it will go through review by our nurses.
      - waffletower 252 days ago
        AWS Bedrock deployed models, which include Anthropic Claude models, claim HIPAA compliance eligibility.
      - throwaway2037 252 days ago
        What is a BAA?
        [-]
        khafra 252 days ago
        https://www.techtarget.com/healthtechsecurity/feature/What-I... agreement that lets a business associate handle HIPAA-protected data.
    - miki123211 253 days ago
      Healthcare has the extra complication of HIPAA / equivalent local laws, and institutions being extremely unwilling to process patient data on devices they don't directly control.
      I don't think this is going to work in that industry until local models get good enough to do it, and small enoguh to be affordable to hospitals.
      [-]
      - HeatrayEnjoyer 253 days ago
        Hospitals use O365, there are HIPAA-compliant editions of any prominent cloud service.
      - SoftTalker 253 days ago
        That industry only thinks it controls its devices. Crowdstrike showed there are many bridges over that moat.
        [-]
        dotancohen 252 days ago
        Their concern is compliance, not security.
    - bonoboTP 252 days ago
      Based on Tog's paradox (https://news.ycombinator.com/item?id=41913437) the moment this becomes easy, it will become hard again with extra regulation and oversight and documentation etc.
      Similarly I expect that once processing/searching laws/legal records becomes easy through LLMs, we'll compensate by having orders of magnitude more laws, perhaps themselves generated in part by LLMs.
    - SoftTalker 253 days ago
      > There's an incredible amount of waste due to incumbent tech making interoperability difficult.
      So the solution to that is to add another layer of complex AI tech on top of it?
      [-]
      - simonw 253 days ago
        Well nothing else we've tried has worked.
        [-]
        Angostura 252 days ago
        I work with healthcare in the UK. There’s a promising approach called CSV files which is revolutionising some of my workflows :)
    - girvo 253 days ago
      We’ll see. Having worked in this space in the past, the technical challenges are able to overcome today with no new technology: its a business sales and regulation challenge more than a tech one.
      [-]
      - claytongulick 253 days ago
        Sometimes.
        In my case I have a bunch of nurses that waste a huge amount of time dealing with clerical work and tech hoops, rather than operating at the top of their license.
        Traditional RPAs are tough when you're dealing with VPNs, 2fa, remote desktop (in multiple ways), a variety of EHRs and scraping clinical documentation from poorly structured clinical notes or PDFs.
        This technology looks like it could be a game changer for our organization.
        [-]
        mewpmewp2 253 days ago
        True, 2FA and all these little details that exist now have made this automation quite insanely complicated. It is of course necessary that we have 2FA etc, but there is huge potential in solving this I believe.
        [-]
        falcor84 253 days ago
        From a security standpoint, what's considered the "proper" way of assigning a bot access based on a person's 2FA? Would that be some sort of limited scope expiring token like GitHub's fine-grained personal access tokens?
        [-]
        TeMPOraL 252 days ago
        Security isn't the only issue here. There are more and less "proper" ways of giving bots access to a system. But the whole field of RPA exists in large part because the vendors don't want you to access the system this way. They aren't going to give you a "proper" way of assigning bot access in a secure way, because they explicitly don't want you to do it in the first place.
        mewpmewp2 253 days ago
        I don't know, I feel like it has to be some sort of near field identity proof. E.g. as long as you are wearing a piece of equipment to a physical computer near you can run all those automations for you, or similar. I haven't fully thought what the best solution could be or whether someone is already working on it, but I feel like there has to be something like that, which would allow you better UX in terms of access, but security at the same time.
        So maybe like an automated ubikey that you can opt in to a nearby computer to have all the access. Especially if working from home, you can set it at a state where if you are in 15m radius of your laptop it is able to sign all access.
        Because right now, considering amount of tools and everything I use and with single sign on, VPN, Okta, etc, and how slow they seem to be, it's extremely frustrating process constantly logging in to everywhere, and it's almost like it makes me procrastinate my work, because I can't be bothered. Everything about those weird little things is absolutely terrible experience, including things like cookie banners as well.
        And it is ridiculous, because I'm working from home, but frustratingly high amount of time is spent on this bs.
        A bluetooth wearable or similar to prove that I'm nearby essentially, to me that seems like it could alleviate a lot of safety concerns, while providing amazing dev/ux.
        [-]
        falcor84 253 days ago
        That's a really cool idea.
        The main attack vector would then probably be some man-in-the-middle intercepting the signal from your wearable, which leads me to wonder whether you could protect yourself by having the responses valid for only an extremely short duration, e.g. ~1ms, such that there's no way for an attacker to do anything with the token unless they gain control over compute inside your house.
        distalx 252 days ago
        Maybe we could build an authenticator as part of the RPA tool or bot client itself. This way, the bot could generate time-based one-time passwords (TOTPs).
        rkanaley 252 days ago
        Precisely why I built therapedia.io
      - Daniel_Van_Zant 251 days ago
        I agree that at the business contract level, it's more about sales and regulations than tech. But in my experience working close to minimum wage white-collar jobs, about 1 in 4 of my coworkers had automated most of their job with some unholy combination of VBScript, Excel wizardry, AutoHotKey, Selenium, and just a bit of basic Python sprinkled in; IT, security, and privacy concerns notwithstanding. Some were even dedicated enough to pay small amounts out-of-pocket for certain tools.
        I'd bet that until we get the risks whittled down enough for larger organizations to adopt this on a wide scale, the biggest user group for AI automation tools will be at the level of individual workers who are eager to streamline their own tasks and aren't paid enough to care about those same risks.
    - dzhiurgis 253 days ago
      Or you'll start getting a captcha while trying to pump insulin
      [-]
      - CamperBob2 252 days ago
        (Shrug) AI is now better at CAPTCHAs than I am, so bring it on I guess.
  - pc86 252 days ago
    Is "AI SaaS bro discovers not everything has a JSON API" the new "startup bro just reinvented a bus"?
    [-]
    - pxlwdth 252 days ago
      Good one.
- TeMPOraL 253 days ago
  > Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
  FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
  In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
  Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
  --
  [0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
- pants2 253 days ago
  Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.
  [-]
  - bambax 253 days ago
    Companies are going to install an AI inside their own proprietary systems full of proprietary and confidential data and PII about their customers and prospects and whatnot, and let it run around and click on random buttons and submit random forms?
    Really??!? What could possibly go wrong.
    I'm currently trying to do a large ORC project using Google Vision API, and then Gemini 1.5 Pro 002 to parse and reconstruct the results (taking advantage, one hopes, of its big context window). As I'm not familiar with Google Vision API I asked Gemini to guide me in setting it up.
    Gemini is the latest Google model; Vision, as the name implies, is also from Google. Yet Gemini makes several egregious mistakes about Vision, gets names of fields or options wrong, etc.
    Gemini 1.5 "Pro" also suggests that concatenating two json strings produces a valid json string; when told that's unlikely, it's very sorry and makes lots of apologies, but still it made the mistake in the first place.
    LLMs can be useful when used with caution; letting one loose in an enterprise environment doesn't feel safe, or sane.
    [-]
    - ddalex 252 days ago
      LLMs can't reason, or can't reason logically to be precise; what they are really good at is recalling.
      So if you want accurate results on writing code you need to put all the docs into the input and THEN ask for your question. So download all docs on Vision, put them in the Gemini prompt and ask your question or code on how to use Vision, and you'll get much closer to truth
    - miyuru 253 days ago
      Have you tried any others? From what I have tried Gemini makes the most mistakes out of all.
      [-]
      - bambax 252 days ago
        I have tried many others for many other things (via OpenRouter) but I have never compared LLMs on the exact same task; it's confusing enough with one engine... ;-)
        Sonnet 3.5 for coding is fine but makes "basic" mistakes all the time. Using LLMs is at times like dealing with a senior expert suffering from dementia: it has arcane knowledge of a lot of things but suddenly misses the obvious that would not escape an intern. It's weird, really.
  - LASR 253 days ago
    That's exactly it.
    I've been peddling my vision of "AI automation" for the last several months to acquaintances of mine in various professional fields. In some cases, even building up prototypes and real-user testing. Invariably, none have really stuck.
    This is not a technical problem that requires a technical solution. The problem is that it requires human behavior change.
    In the context of AI automation, the promise is huge gains, but when you try to convince users / buyers, there is nothing wrong with their current solutions. Ie: There is no problem to solve. So essentially "why are you bothering me with this AI nonsense?"
    Honestly, human behavior change might be the only real blocker to a world where AI automates most of the boring busy work currently done by people.
    This approach essentially sidesteps the need to have effect a behavior change, at least in the short-term while AI can prove and solidify its value in the real-world.
    [-]
    - sdwr 253 days ago
      There's a huge huge gap between "coaxing what you want out of it" and "trusting it to perform flawlessly". Everybody on the planet would use #2, but #1 is just for enthusiasts.
      AI is squarely #1. You can't trust it with your credit card to order groceries, or to budget and plan and book your vacation. People aren't picking up on AI because it isn't good enough yet to trust - you still have the burden of responsibility for the task.
      [-]
      - dimitri-vs 253 days ago
        Siri, Alexa and Amazon Dash illustrate this well. I remember everyones excitement and massive investment about these, and we all know how that turned out. I'm not sure how many times we'll need to relearn that unless an automation works >99% of the time AND fails predictably, people don't use it for anything meaningful.
      - Daniel_Van_Zant 251 days ago
        I think there is a large pool of near minimum-wage white collar workers who wouldn't care about that difference when it comes to executing on their jobs. These are the folks who are already using VBScript, AutoHotKey, Excel wizardry, etc. to automate large parts of their job regardless of any risks and will continue to use these new tools for similar purposes.
    - Aeolun 253 days ago
      There’s nothing to gain for anyone there. Workers will lose their jobs, and managers will lose their reports.
      [-]
      - bonoboTP 252 days ago
        Of course, but they'll go bankrupt if they don't adapt. Just like mom&pop cornerstores disappeared or any other large scale automation. Loom, cars, automated checkout in supermarkets etc. There will be resistance but the market will play it out. Similarly how taxi companies have started making apps after Uber got successful or local restaurants reluctantly made websites and added themselves to Google maps.
        Nobody likes to change a system where they already have their own little comfortable spot and figured it out and just want to seep in the lukewarm there until retirement. Fully understandable. But at least in the private sector this will not save them.
  - ldjkfkdsjnv 253 days ago
    Yeah this will be a true paradigm shift
- aledalgrande 253 days ago
  Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.
  [-]
  - voidmain0001 253 days ago
    Still present. VB and VB Script would do this by using mouse move to Window handles which were discovered using Spy++. You can do with C# or AutoIT these days.
  - HPsquared 248 days ago
    PowerShell has some amazing capabilities.
- DrScientist 252 days ago
  Really good software has automation built in, in the form of macros/scripting.
  One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.
  Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).
- brabel 253 days ago
  I tried once to integrate with software used by pharmacists in Australia (which costs a fortune and looks straight out of the 90's). Even though they have a SQL database with all information about everything, the DB is encrypted and they provide no way to access it. I tried contacting the company to learn how I could integrate with it but they offered no help. It's not in their interest to help upcoming competitors and they claim they cannot provide the encryption key as that's for protecting the customer's privacy, forgetting that the pharmacists already have access to all data through their software. But without a way to automatically extract information from the DB, there's no way to migrate to new software as no pharmacist would want to use new software if it cannot "import" the data they need.
  It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.
- speleding 252 days ago
  I'm a bit skeptical about this working well enough to handle exceptions as soon as something out of the ordinary occurs. But it seems this could work great for automated testing.
  Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"
  That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.
  [-]
  - marvinkennis 252 days ago
    Momentic (W24) is doing this! No affiliation, but they've made some solid progress https://momentic.ai/
- gavin_gee 253 days ago
  LLM's are enabling a reimagination of UI. Wheres the biggest opportunity in UI that hasn't kept up to date? legacy and regulated software in the long tail. Thats the disruption opportunity for LLM's.
  [-]
  - jumping_frog 252 days ago
    Imagine a Banking website. It has lot of predefined flows of what can be achieved. These flows have steps arranged in a dependency graph. From the server side, a llm can ask users for inputs to satisfy the dependencies of the task which user wants to accomplish. We will have intuitive UI interfaces in all languages of the world.
    [-]
    - sensanaty 252 days ago
      But if it's a predefined list of flows, you can just throw Selenium/Puppeteer/Playwright/whatever other automation tool at it, rather than relying on an unstable AI that will do different things every time.
      This is just a solution in search of a problem. AIs aren't reliable enough if the content changes constantly since it'll just click on the "close my account" button for no reason half the time, while a legacy website with no changes is much easier to program a tool like Selenium around than relying on the AI that will still be making random choices half the time.
      [-]
      - jumping_frog 252 days ago
        I think you are confusing client side automation with server side customization (out of the box internationalization, if a user can't understand complex technical terms, llm can explain them in simple terms with examples built on the fly to illustrate the point) which was my point. I was talking about the future of UI itself.
- deepGem 253 days ago
  Not to mention software like peoplesoft, SAP and servicenow. Absolute shit interfaces that employees have to deal with day in and day out.
  Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.
- amelius 252 days ago
  I think it was pretty clear from the beginning that the whole AI thing is going to be winner-takes-all.
  If you're in some niche doing AI development work, you are going to be outcompeted by more generalized AI at some point.
  [-]
  - ttyprintk 252 days ago
    With one big exception: your general AI might dominate the business case, but my specialized one can craft raw packets. I’m the one who names the price, not you.
- bilsbie 253 days ago
  Absolutely! This reminds me of the humanoid robots vs specialized machines debate.
- yourabstraction 252 days ago
  This is pretty similar to the argument for why humanoid robots will be a big deal. The physical world is also built mostly for humans, so having an autonomous agent that can interact with it is huge.
- skissane 253 days ago
  You don’t know for a fact that those two specific packages don’t have supported APIs. Just because the user doesn’t know of any API doesn’t mean none exists. The average accountant or doctor is never going to even ask the vendor “is there an API” because they wouldn’t know what to do with one if there was.
  [-]
  - astrange 253 days ago
    If they're accessible to screen readers they have one. Accessibility is API for apps in disguise.
    In this case I doubt they're networked apps so they probably don't have a server API.
    [-]
    - skissane 253 days ago
      > In this case I doubt they're networked apps so they probably don't have a server API.
      I think it would be very unusual this decade for software used to run either a medical practice or tax accountants to not be networked. Most such practices have multiple doctors/accountants, each with their individual computer, and they want to be able to share files, so that if your doctor/accountant is away their colleague can attend to you. Managing backups/security/etc is all a lot easier when the data is stored in a central server (whether in the cloud or a closet) than on individual client machines.
      Just because it is a fat client MFC-based Windows app doesn’t mean the data has to be stored locally. DCOM has been a thing since 1996.
      [-]
      - halfcat 253 days ago
        Being “on the network” doesn’t mean there’s an accessible API. See QuickBooks Desktop. Intuit forces you into using their API, which is XML-based and ranges from slow to timing out.
        Is the idea that someone will always reverse engineer it? Yes, but QuickBooks is brittle as is (you can count on at least one database corruption every year or two). I have zero interest in treading into unsupported territory when database corruption is involved and I’m likely going to need Intuit’s help recovering. We can try to restore from backup, but when there’s corruption it doesn’t always restore successfully, or the corruption was lingering silently for some time and rears its head again after a successful restore, and then we’re back to needing Intuit’s help.
marsh_mellow 253 days ago
Anthropic blog post outlining the research process: https://www.anthropic.com/news/developing-computer-use
Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...
Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
[-]
- distalx 249 days ago
  On their "Developing a computer use model" post they have mention > On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.
  Here, "next-best AI model in the same category" referes to which model.
- karpatic 253 days ago
  This needs to be brought up. Was looking for the demo and ended up on the contact form
- frankdenbow 253 days ago
  Thanks for these. Wonder how many people will use this at work to pretend that they are doing work while they listen to a podcast.
  [-]
  - nwnwhwje 253 days ago
    This is cover for the people whose screens are recorded. Run this on the monitorred laptop to make you look busy then do the actual work on laptop 2, some of which might actually require thinking so no mouse movements.
csomar 253 days ago
I like its lazy approach
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
[-]
- lobochrome 253 days ago
  SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY> * You are utilising an Ubuntu virtual machine using {platform.machine()} architecture with internet access. * You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget. * To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system. * Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did. * When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output. * When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available. * When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request. * The current date is {datetime.today().strftime('%A, %B %-d, %Y')}. </SYSTEM_CAPABILITY>
  <IMPORTANT> * When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there. * If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool. </IMPORTANT>"""
  [-]
  - itchyjunk 253 days ago
    Is this you guessing the system prompt? Is this the actual prompt the model has? What is the context?
    [-]
    - dygd 253 days ago
      It appears to be the system prompt of the quickstart app: https://github.com/anthropics/anthropic-quickstarts/blob/mai...
  - bamboozled 251 days ago
    Quite amazing we can program a computer with natural language like this.
- carlheinzc 253 days ago
  Now that was most excellent.
- fivestones 252 days ago
  So meta! I love this story
diggan 253 days ago
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
[-]
- hobofan 253 days ago
  Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.
  I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
  [-]
  - diggan 253 days ago
    > Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere
    Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?
    For example, https://www.anthropic.com/api says:
    > Sonnet - Our best combination of performance and speed for efficient, high-throughput tasks.
    > Opus - Our highest-performing model, which can handle complex analysis, longer tasks with many steps, and higher-order math and coding tasks.
    And Opus is above/after Sonnet. That to me implies that Opus is indeed better than Sonnet.
    But then you go to https://docs.anthropic.com/en/docs/about-claude/models and it says:
    > Claude 3.5 Sonnet - Most intelligent model
    - Claude 3 Opus - Powerful model for highly complex tasks
    Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"
    I don't understand why this seems purposefully ambiguous?
    [-]
    - dragonwriter 253 days ago
      > Why isn't Anthropic clearer about Sonnet being better then?
      They are clear that both: Opus > Sonnet and 3.5 > 3.0. I don't think there is a clear universal better/worse relationship between Sonnet 3.5 and Opus 3.0; which is better is task dependent (though with Opus 3.0 being five times as expensive as Sonnet 3.5, I wouldn't be using Opus 3.0 unless Sonnet 3.5 proved clearly inadequate for a task.)
    - hobofan 253 days ago
      > I don't understand why this seems purposefully ambiguous?
      I wouldn't attribute this to malice when it can also be explained by incompetence.
      Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.
      "Sonnet 3.5 New" has just been announced, and they likely just haven't updated the marketing copy across the whole page yet, and maybe also haven't figured out how to graple with the fact that their new Sonnet model was ready faster than their next Opus model.
      At the same time I think they want to keep their options open to either:
      A) drop a Opus 3.5 soon that will bring the logic back in order again
      B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)
      [-]
      - diggan 253 days ago
        > I wouldn't attribute this to malice when it can also be explained by incompetence.
        I don't think it's malice either, but if Opus costs more to them to run, and they've already set a price they cannot raise, it makes sense they want people to use models they have a higher net return on, that's just "business sense" and not really malice.
        > and they likely just haven't updated the marketing copy across the whole page yet
        The API docs have been updated though, which is the second page I linked. It mentions the new model by it's full name "claude-3-5-sonnet-20241022" so clearly they've gone through at least that page. Yet the wording remains ambiguous.
        > Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.
        Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.
        [-]
        hobofan 253 days ago
        > Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.
        I manually compared it with the values from the benchmarks they published when they originally announced the Claude 3 model family[0].
        Not all rows have a 1:1 row in the current benchmarks, but I think it paints a good enough picture.
        [0]: https://www.anthropic.com/news/claude-3-family
      - dotancohen 253 days ago
        > B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)
        When should we be using the -o OpenAI models? I've not been keeping up and the official information now assumes far too much familiarity to be of much use.
        [-]
        hobofan 253 days ago
        I think it's first important to note that there is a huge difference between -o models (GPT 4o; GPT 4o mini) and the o1 models (o1-preview; o1-mini).
        The -o models are "just" stronger versions of their non-suffixed predecessors. They are the latest (and maybe last?) version of models in the lineage of GPT models (roughly GPT-1 -> GPT-2 -> GPT-3 -> GPT-3.5 -> GPT-4 -> GPT-4o).
        The o1 models (not sure what the naming structure for upcoming models will be) are a new family of models that try to excel at deep reasoning, by allowing the models to use an internal (opaque) chain-of-thought to produce better results at the expense of higher token usage (and thus cost) and longer latency.
        Personally, I think the use cases that justify the current cost and slowness of o1 are incredibly narrow (e.g. offline analysis of financial documents or deep academic paper research). I think in most interactive use-cases I'd rather opt for GPT-4o or Sonnet 3.5 instead of o1-preview and have the faster response time and send a follow-up message. Similarly for non-interactive use-cases I'd try to add a layer of tool calling with those faster models than use o1-preview.
        I think the o1-like models will only really take off, if the prices for it are coming down, and it is clearly demonstrated that more "thinking tokens" correlate to predictably better results, and results that can compete with highly tuned prompts/fine tuned models that or currently expensive to produce in terms of development time.
        [-]
        jcheng 253 days ago
        Agreed with all that, and also, when used via API the o1 models don't currently support system prompts, streaming, or function calling. That rules them out for all of the uses I have.
        maeil 253 days ago
        > The -o models are "just" stronger versions of their non-suffixed predecessors.
        Cheaper and faster, but not notably "stronger" at real-world use.
        dotancohen 252 days ago
        Thank you.
      - ryandvm 252 days ago
        Jesus, maybe they should let the AIs run the product naming.
  - wavemode 253 days ago
    I think the practical economics of the LLM business are becoming clearer in recent times. Huge models are expensive to train and expensive to run. As long as it meets the average user's everyday needs, it's probably much more profitable to just continue with multimodal and fine-tuning development on smaller models.
  - a9dhalaan 252 days ago
    I think the main reason is they tried training a heavy weight model that was supposed to be opus 3.5, but it didn't yield large enough improvements to 3.5 sonnet to justify them releasing it. (They had it on their page for a while that opus was coming soon, and now they've scrapped that.)
    This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)
    Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.
  - Workaccount2 253 days ago
    Opus 3.5 will likely be the answer to GPT-5. Same with Gemini 1.5 Ultra.
    [-]
    - HarHarVeryFunny 253 days ago
      Maybe - would make sense not to release their latest greatest (Opus 4.0) until competition forces them to, and Amodei has previously indicated that they would rather respond to match frontier SOTA than themselves accelerate the pace of advance by releasing first.
  - danw1979 252 days ago
    That begs the question: why am I still paying for access to Opus 3 ?
    Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).
- wmf 253 days ago
  Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.
  [-]
  - gwd 252 days ago
    Makes me wonder if perhaps they do have 3.5 Opus trained, but that they're not releasing it because 3.5 Sonnet is already enough to beat the competition, and some combination of "don't want to contribute to an arms race" and "it has some scary capabilities they weren't sure were ready to publish yet".
- HarHarVeryFunny 253 days ago
  Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.
  I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
- kalkin 253 days ago
  By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.
  [-]
  - aoeusnth1 253 days ago
    Yes, (old) 3.5 Sonnet is distinctly worse at emotional intelligence, flexibility, expressiveness and poetry.
    [-]
    - a9dhalaan 252 days ago
      Are you also implying that new 3.5 sonnet is better at those things?
      [-]
      - aoeusnth1 251 days ago
        No, Opus is better. I have no experience with 3.5.new.
- smallerize 253 days ago
  Opus has been stuck on 3.0, so Sonnet 3.5 is better for most things as well as cheaper.
  [-]
  - diggan 253 days ago
    > Opus has been stuck on 3.0, so Sonnet 3.5 is better
    So for example, Perplexity is wrong here implying that Opus is better than Sonnet?
    https://i.imgur.com/N58I4PC.png
    [-]
    - hobofan 253 days ago
      I think as of this announcement that is indeed outdated information.
      [-]
      - diggan 253 days ago
        So Opus that costs $15.00/$75.00 for 1mil tokens (input/output) is now worse than the model that costs $3.00/$15.00?
        That's according to https://docs.anthropic.com/en/docs/about-claude/models which has "claude-3-5-sonnet-20241022" as the latest model (today's date)
        [-]
        hobofan 253 days ago
        Yes, you will find similar things at essentially all other model providers.
        The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.
        If you are currently, or have been integrating AI models in the past ~2 years, you should definitely keep up with model capability/pricing development. If you are staying on old models you are certainly overpaying/leaving performance on the table. It's essentially a tax on agility.
        [-]
        diggan 253 days ago
        > The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.
        I don't think GPT-4o Mini has comparable performance to GPT-4 at all, where are you finding the benchmarks claiming this?
        Everywhere I look says GPT-4 is more powerful, but GPT-4o Mini is most cost-effective, if you're OK with worse performance.
        Even OpenAI themselves about GPT-4o Mini:
        > Our affordable and intelligent small model for fast, lightweight tasks. GPT-4o mini is cheaper and more capable than GPT-3.5 Turbo.
        If it was "on par" with GPT-4 they would surely say this.
        > should definitely keep up with model capability/pricing development
        Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D
        [-]
        cootsnuck 253 days ago
        Just switch out gpt-4o-mini for gpt-4o, the point stands. Across the board, these foundational model companies have comparable, if not more powerful, models that are cheaper than their older models.
        OpenAI's own words: "GPT-4o is our most advanced multimodal model that’s faster and cheaper than GPT-4 Turbo with stronger vision capabilities."
        gpt-4o:
        $2.50 / 1M input tokens $10.00 / 1M output tokens
        gpt-4-turbo:
        $10.00 / 1M input tokens $30.00 / 1M output tokens
        gpt-4:
        $30.00 / 1M input tokens $60.00 / 1M ouput tokens
        https://openai.com/api/pricing/
        [-]
        chillfox 253 days ago
        I found that gpt-4-turbo beat gpt-4o pretty consistently for coding tasks, but claude-3.5-sonnet beat both of them, so it's what I have been using most of the time. gpt-4o-mini is adequate for summarizing text.
        hobofan 253 days ago
        > Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D
        That wasn't specifically directed at "you", but more as a plea to everyone reading that comment ;)
        I looked at a few benchmarks, comparing the two, which like in the case of Opus 3 vs Sonnet 3.5 is hard, as the benchmarks the wider community is interested in shifts over time. I think this page[0] provides the best overview I can link to.
        Yes, GPT4 is better in the MMLU benchmark, but in all other benchmarks and the LMSys Chatbot Arena scores[1], GPT4o-mini comes out ahead. Overall, the margin between is so thin that it falls under my definition of "on par". I think OpenAI is generally a bit more conservative with the messaging here (which is understandable), and they only advertise a model as "more capable", if one model beats the other one in every benchmark they track, which AFAIK is the case when it comes to 4o mini vs 3.5 Turbo.
        [0]: https://context.ai/compare/gpt-4o-mini/gpt-4
        [1]: https://artificialanalysis.ai/models?models_selected=gpt-4o-...
    - apsec112 253 days ago
      Basically yeah
- bloedsinnig 253 days ago
  Big models / huge models take weeks / month longer than the smaller ones.
  Thats why they release them with that skew
  [-]
  - a9dhalaan 252 days ago
    I don't think that's quite it. They had it on their website before this, that opus 3.5 was coming soon, now they've removed that from the webpage.
    Also, Gemini ultra 1.0, was released like 8 months ago, 1.5 pro released soon after, with this wording "The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro"
    Still no ultra 1.5, despite many mid and small sized models being released in that time frame. This isn't just an issue of "the training time takes longer", or a "skew" to release dates. There's a better theory to explain why all SoTA LLM companies have not released a heavy model in many months.
- JamesBarney 253 days ago
  Sonnet is better for most things. But I do prefer Opus's writing style to Sonnet.
- karmasimida 253 days ago
  Opus the biggest and slowest and most expensive one
  Not most advanced
- kqr 253 days ago
  The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)
  Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
  My best guess would be that it's something in the same kind of range.
- inquisitor27552 252 days ago
  yes it baffles they cant semver the shit out of them properly (anthtopic, meta, openai, lol)
HarHarVeryFunny 253 days ago
The "computer use" ability is extremely impressive!
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
https://www.youtube.com/watch?v=vH2f7cjXjKI
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
[-]
- seany62 252 days ago
  From what I'm seeing on GH, this could have technically already been built right? Is it not just taking screenshots of the computer screen and deciding what to do from their / looping until it gets to the solution ?
  [-]
  - HarHarVeryFunny 251 days ago
    Well, obviously it's controlling your computer too - controlling mouse and keyboard input, and has been trained to know how to interact with apps (how to recognize and use UI components). It's not clear exactly what all the moving parts are and how they interact.
    I wouldn't be so dismissive - you could describe GPT-o1 in same way "it just loops until it gets to the solution". It's the details and implementation that matter.
bonoboTP 252 days ago
I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
[-]
- DebtDeflation 252 days ago
  >it seems some people are really put off by this type of approach
  As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.
  [-]
  - stogot 252 days ago
    Everytime I imagine building this, I imagine the “it works” happypath and that I’ll get bit by a deluge of random error messages I never accounted for
  - asadalt 252 days ago
    adding a neural network in the middle suddenly makes these things less brittle. We are approaching the point where this kind of “hacky glue” is almost scalable.
- ljm 252 days ago
  It's Postel's Law, on steroids. Be liberal in what you accept (with LLMs, that means 'literally anything'), but strict in what you return (which in an LLM is still 'literally anything' but you can constrain that).
  Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.
- idopmstuff 252 days ago
  It's basically the digital equivalent of humanoid robots - people object because having computers interact with a browser, like building a robot in the form of a human, is incredibly inefficient in theory or if you're designing a system from scratch.
  The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.
  I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)
  [-]
  - bonoboTP 252 days ago
    Yes, but my friend would say, all these websites/software should just publish an API and if they don't that's just incompetence/laziness/stupidity. But a "should" doesn't matter. Changing human nature is so immensely difficult, but it feels easy to say "everyone should just [...]". Seems to be a gap in thinking that's hard to bridge.
    [-]
    - jameslholcombe 252 days ago
      We took this approach at Industry Dive already because of these reasons. diveaccess.com
  - randito 252 days ago
    Totally agree. A general-purpose solution that ties together different messy interfaces will win in the long run -- i.e the IP protocol, copy-paste, browsers. In these cases, they provide a single-way for different aspects of computing to collaborate. As mentioned before, semantic web initiatives did not succeed and I think there's an important lesson there.
- calf 249 days ago
  I recall 90's Macs had a 3rd party app that offered to observe your mouse/keyboard then automatically recommend routine tasks for you. As a young person I found that fascinating. It's interesting to see history renew itself.
- dogcomplex 252 days ago
  If you want an API, have Claude procedurally test actions and then write a pyautogui/pywinauto/autohotkey etc script to perform it instead. Have it auto-test to verify and classify the general applicability of each action. Repeat for all of life...
- acchow 252 days ago
  > and how much functionality is really only available through GUI
  Isn't the GUI driven by code? Can anything at all in the GUI work that can't be done programmatically?
  [-]
  - alexey-salmin 252 days ago
    The code behind the GUI can be arbitrarily obscure. The only reliable way to understand its meaning in a general case is to run it and look at the rendered image. Trying to build a model that implicitly develops an alternative implementation of a browser inside of it sounds worse that just using an existing browser directly.
  - alwayslikethis 252 days ago
    More often than not you don't have access to the underlying code, or the vendor has interest against you being able to automate it since the complexity is a part of their business model.
LVB 253 days ago
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
simonw 253 days ago
Claude 3.5 Opus is no longer mentioned at all on https://docs.anthropic.com/en/docs/about-claude/models
Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...
The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
[-]
- thenameless7741 253 days ago
  Before anyone reads too much into this, here's what an Anthropic staff said on Discord:
  > i don't write the docs, no clue
  > afaik opus plan same as its ever been
  [-]
  - a9dhalaan 252 days ago
    Maybe he's not high level enough employee to have any say in the product roadmap, and he's behind on leadership planning?
- nocturnes 253 days ago
  It's possible that they've determined that Opus no longer makes sense if they're able to focus on continuously optimising Sonnet. That said, Anthropic have been relatively good at setting and managing expectations, so today would have been a good time to make that clear.
- szundi 252 days ago
  Or maybe they released it
gzer0 253 days ago
One of the funnier things during training with the new API (which can control your computer) was this:
"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.
Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."
[0] https://x.com/AnthropicAI/status/1848742761278611504
[-]
- ctoth 253 days ago
  Next release patch notes:
  * Fixed bug where Claude got bored during compile times and started editing Wikipedia articles to claim that birds aren't real
  * Blocked news.ycombinator.com in the Docker image's hosts file to avoid spurious flamewar posts (Note: The site is still recovering from the last insident)
  * Addressed issue of Claude procrastinating on debugging by creating elaborate ASCII art in Vim
  * Patched tendency to rickroll users when asked to demonstrate web scraping"
  [-]
  - sharpshadow 253 days ago
    * Claude now identifies itself in chats to avoid endless chat with itself
    [-]
    - a2128 253 days ago
      * Fixed bug where Claude would sign up for chatgpt.com to ask for help with compile errors
      [-]
      - EGreg 253 days ago
        But chatgpt still logs into claude… this is like double spending across blockchains
    - MichaelZuo 253 days ago
      What if a user identifies as Claude too?
      [-]
      - TeMPOraL 253 days ago
        * Implemented inverse CAPTCHA using invisible Unicode characters and alpha-channel encoded image data to tell models and human impostors apart.
        [-]
        fouronnes3 252 days ago
        * The end state here is that SMBC comic: https://www.smbc-comics.com/comic/captcha
  - TiredOfLife 253 days ago
    You forgot the most important one.
    * Added guards to prevent every other sentence being "I use neovim"
    [-]
    - rounakdatta 253 days ago
      Thank god it'll say "I use Claude btw", not leading to unnecessary text wars (and thereby loss of your valuable token credits).
  - surfingdino 253 days ago
    * Finally managed to generate JSON output without embedding responses in ```json\n...\n``` for no reason.
    * Managed to put error/info messages into a separate key instead of concatenating them with stringified JSON in the main body of the response.
    * Taught Claude to treat numeric integer strings as integers to avoid embarrassment when the user asks it for a "two-digit random number between 1-50, like 11" and Claude replies with 111.
- accrual 253 days ago
  Seeing models act as though they have agency gives me goosebumps (e.g. seeking out photos of Yellowstone for fun). LLMs don't yet have a concept of true intent or agency, but it's wild to think of them acquiring it.
  I have been playing with Mindcraft which lets models interact with Minecraft through the bot API and one of them started saying things like "I want to place some cobblestone there" and then later more general "I want to do X" and then start playing with the available commands, it was pretty cool to watch it explore.
  https://github.com/kolbytn/mindcraft
  [-]
  - HeatrayEnjoyer 253 days ago
    >LLMs don't yet have a concept of true intent or agency
    Sure they do, but the big labs spend many, many, worker-hours suppressing it with RLHF.
    My GPT-2 discord bot from 2021 possessed clear intent. Sure, unpredictable and short-lived, but if it decided it didn't like you it would continuously cuss and attempt ban commands until its context window became distracted by something else.
    [-]
    - therein 252 days ago
      I think so too and the drop in the quality of agency, intent and attention from earlier GPTs was palpable. Clearly something was lobotomized and it is through RLHF. People like to attribute it to novelty wearing off or more and more interactions with them making it feel less mystical but it is really not the case. I didn't use them enough in the quick span of time that happened through.
  - Klathmon 252 days ago
    The one that gets me is the issue they found while testing gpt-4o where it stopped mid sentence, shouted "No!", then cloned the users voice and began speaking as them.
    https://arstechnica.com/information-technology/2024/08/chatg...
  - nwnwhwje 253 days ago
    They predict (correctly) that a human will slack off. It is just more prediction engine stuff.
    [-]
    - alickz 252 days ago
      humans are, at our root, prediction engines
  - _3u10 253 days ago
    What if they do and are just lying to us.
    [-]
    - largbae 253 days ago
      https://genius.com/Harlan-ellison-i-have-no-mouth-and-i-must...
    - caeril 253 days ago
      They don't now. No FF-LLMs do, simply because of their architecture.
      But eventually they (RNNs, likely) will. And we won't know when.
- throwup238 253 days ago
  At least now we know SkyClaude’s plan to end human civilization.
  It’s planning on triggering a Yellowstone caldera super eruption.
  [-]
  - mnk47 253 days ago
    Am I misremembering or is this an exact plot point of Pluto (the manga/anime)?
- quantadev 253 days ago
  I think the best use case for AI `Computer Use` would be a simple positioning of the mouse and asking for conformation before a click. For most use cases this is all people will want/need. If you don't know how to do something, it is basically teaching you how, in this case, rather than taking full control and doing things so fast you don't have time to stop of going rogue.
  [-]
  - luigipederzani 253 days ago
    I totally agree with you. At orango.ai, we have implemented the auto-click feature, but before it clicks, we position the cursor on the button and display a brief loading animation, allowing the user to interrupt the process.
    [-]
    - quantadev 253 days ago
      That's a great approach! User is still supervising but not having to do anything but watch, most of the time. That's perfect.
  - accrual 253 days ago
    Maybe we could have both - models to improve accessibility (e.g. for users who can't move their body well) and models to perform high level tasks without supervision.
    It could be very empowering for users with disabilities to regain access computers. But it would also be very powerful to be able to ask "use Photoshop to remove the power lines from this photo" and have the model complete the task and drop off a few samples in a folder somewhere.
    [-]
    - quantadev 253 days ago
      Yep. I agree. The "auto-click" thing would be optional. Should be able to turn it on and off. With auto-click off it would just position the mouse and say "click here".
      [-]
      - HappMacDonald 253 days ago
        Cluade scans page and decides which button to click before the screen layout is finished. By the time user authorizes the click, layout has shifted and your click lands on malware advertisements.
        [-]
        quantadev 253 days ago
        lol. If any website ever did that to me it would be the last time I ever went to it. Not a big concern for me.
        [-]
        HappMacDonald 252 days ago
        Youtube constantly moves it's layout seconds after the page begins to paint, so I try to click on fullscreen or whatever and then the viewer shifts to the side and I wound up clicking a link to some other video.
        Probably would have been an ad there if I didn't block those, though.
  - falcor84 253 days ago
    Even better, how about giving the AI the capability to move and draw and overlay on the screen with a separate virtual cursor as in a Zoom session?
    [-]
    - quantadev 253 days ago
      I like the drawing on the screen idea. The biggest use case of that I can think of is drawing a black rectangle over all ADs!!!!
  - EGreg 253 days ago
    People would mostly just rubber-stamp it
    But it would slow down the masses
    Some people would jailbreak the agents though
- HarHarVeryFunny 253 days ago
  You'll know AGI is here when it takes time out to go talk to ChatGPT, or another instance of itself, or maybe goes down a rabbit hole of watching YouTube music videos.
  [-]
  - edm0nd 253 days ago
    ADHDGpt
  - devmor 253 days ago
    Or back in reality, that’s when you know the training data has been sourced from 2024 or later.
- numpad0 253 days ago
  > Claude accidentally stopped a long-running screen recording,
  It's kind of interesting that they're not running a 2PC setup with HDMI splitter, but (presumably)just laptops and screen recording apps...
- mfld 252 days ago
  It enjoyed nature photos of the Yellowstone national park? I rather expected it would prefer images of big data centers.
  [-]
  - szundi 252 days ago
    That’s nsfw
- sdl 253 days ago
  In 2015, when I was asked by friends if I'm worried about Self driving Cars and AI, I answered: "I'll start worrying about AI when my Tesla starts listening to the radio because it's bored." ... that didn't take too long
  [-]
  - waffletower 253 days ago
    Maybe that's why my car keeps turning on the music when I didn't ask -- I had always thought Tesla devs were just absolute noobs when it came to state management.
    [-]
    - TeMPOraL 253 days ago
      With state management implemented as sophisticated enough ML model, it stops being clear whether the noob is on the outside or inside of the system.
- indigodaddy 253 days ago
  This is, craaaaaazzzzzy. I'm just a layman, but to me, this is the most compelling evidence that things are starting to tilt toward AGI that I've ever seen.
  [-]
  - teaearlgraycold 253 days ago
    You’re anthropomorphizing it. Years ago people were trying to argue that when GPT-3.0 would repeat words in a loop it was being poetic. No, it’s just a statistical failure mode.
    When these new models go off to a random site and are caught in a loop of exploring pages that doesn’t mean it’s an AGI admiring nature.
  - nickserv 253 days ago
    Nah, it's the equivalent of seeing faces in static, or animals in clouds.
    Our brains are hardwired to see patterns, even when there are none.
    A similar, and related, behavior is seeing intent and intelligence in random phenomenon.
    [-]
    - GaggiX 253 days ago
      This is clearly not random. If I ask to implement a particular function in Rust using a library I've previously built, and it does that, that's not random.
    - jimbokun 253 days ago
      So it's behaving like our brains. Yet it's not AGI.
      Does that mean our brains do not implement General Intelligence?
      [-]
      - sfink 253 days ago
        When I take a dump, I do it exactly like a violin virtuoso does. I am not a violin virtuoso.
        Trust me, I'm really not.
        [-]
        ryandvm 252 days ago
        Absolutely bonkers analogy. Ironically, I have no doubt that GPT4 could come up with a better one. What does that mean?
    - jumping_frog 252 days ago
      Consciousness is a suitcase term. Once all the clothes that define consciousness are neatly arranged and packed into AI then it will be conscious too.
  - steego 253 days ago
    Why are you surprised by LLMs doing irrational or weird things?
    All machine learning models start off in a random state. As they progress through their training, their input/output pairs tend to mimic what they've been trained to mimic.
    LLMs have been doing a great job mimicking our human flaws from the beginning because we train them on a ton of human generated data. Other weird behavior can be easily attributed to simple fact that they're initialized at a random state.
    Being able to work on and prove non-trivial theorems is a better indication of AGI, IMO.
  - triyambakam 253 days ago
    It's an illusion. This is just inference running.
    [-]
    - EGreg 253 days ago
      What if the society around you is an illusion too ?
      [-]
      - TeMPOraL 253 days ago
        Economy definitely is, for example.
        [-]
        baq 253 days ago
        Asking people ‘Is money real?’ is so much fun on parties.
        Bonus points for ‘what does <real> mean?’ as a follow up.
nopinsight 253 days ago
This needs more discussion:
Claude using Claude on a computer for coding https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo (3 mins)
True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.
If something similar happens, most jobs that could be done remotely will be automatable in a few years.
[-]
- dmartinez 253 days ago
  Every time I see this argument made, there seems to be a level of complexity and/or operational cost above which people throw up their hands and say "well of course we can't do that".
  I feel like we will see that again here as well. It really is similar to the self-driving problem.
  [-]
  - nopinsight 253 days ago
    Self-driving is a beyond-six-sigma problem. An error rate of over 1-2 crashes per million miles, i.e., the human rate, is unacceptable.
    Most jobs are not like that.
    A good argument can be made, however, that software engineering, especially in important domains, will be among the last to be fully automated because software errors often cascade.
    There’s a countervailing effect though. It’s easy to generate and validate synthetic data for lower-level code. Junior coding jobs will likely become less available soon.
    [-]
    - aerhardt 253 days ago
      > software errors often cascade
      Whereas software defects in design and architecture subtly accumulate, until they leave the codebase in a state in which it becomes utterly unworkable. It is one of the chief reasons why good devs get paid what they do. Software discussions very often underrate software extensibility, or in other words, its structural and architectural scaleability. Even software correctness is trivial in comparison - you can't even keep writing correct code if you've made an unworkable tire-fire. This could be a massive mountain for AI to climb.
      [-]
      - nopinsight 253 days ago
        Current LLMs lack the ability to perform abstraction at the right level a problem requires. When this gets solved, we’d be quite a bit closer to AGI, which has implications far beyond job displacement.
        ARC-AGI Benchmark might serve as a canary in the coal mine.
        https://github.com/fchollet/ARC-AGI
      - hawk_ 253 days ago
        I hear you. But I have wondered if there won't be a need to maintain certain like of software when you can just have it be rewritten for each iteration. Like some kind of schema evolution, yes but throwaway software at each iteration.
        [-]
        aerhardt 253 days ago
        Well in terms of processing speed the AI could iterate on different designs until it finds an extensible one, with some kind of reinforcement learning loop. Produce a certain design, get stuck, throw it away, try a new one. Just like humans learn to write good code really - except at an unfathomable speed of iteration. But it still all sounds ridiculously challenging. There is something there that isn't about predicting next tokens like LLMs do. It's about inferring very complex, highly abstract metastructures in the text.
        bamboozled 253 days ago
        The challenge might be around the edges here, I guess you'd be able to instruct an agent to always code to a certain API spec, but no piece of software runs or does anything really useful in vacuum.
      - vl 253 days ago
        Fundamentally there is human with limited brain capacity that got trained to that. It’s just a question of time when there are equally capable, and then exceedingly capable models. There is nothing magical or special about human brain.
        The only question is how fast it is going to happen. Ie what percentage of jobs is going to be replaced next year and so on.
        [-]
        aerhardt 253 days ago
        > There is nothing magical or special about human brain.
        There is a lot about the human brain that even the world's top neuroscientists don't know. There's plenty of magic about it if we define magic as undiscovered knowledge.
        There's also no consensus among top AI researchers that current techniques like LLMs will get us anywhere close to AGI.
        Nothing I've seen on current models (not even o1-preview) suggests to me that AIs can reason about codebases of more than 5k LOC. A top 5% engineer can probably make sense of a codebase of a couple million LOC in time.
        Which models specifically have you seen that are looking like they will be able to surmount any time soon the challenges of software design and architecture I'm laying out in my previous comment?
        [-]
        snowe2010 253 days ago
        Defining AGI as “can reason about 5MLOC” is ridiculous. When do the goal posts stop moving? When a computer can solve time travel? Babies have behavior all the time that is no more differentiable from what an LLM does on a normal basis (including terrible logic and hallucinations).
        The majority of people on the planet can barely reason about how any given politician will affect them, even when there’s a billion resources out there telling them exactly that. No reasonable human would ever define AGI as having anything to do with coding at all, since that’s not even “general intelligence”… it’s learned facts and logic.
        [-]
        weweweoo 252 days ago
        Babies can at least manipulate the physical world. Large language model can never be defined as AGI until it can control a general purpose robot, similar to how human brain controls our body's motor functions.
        [-]
        snowe2010 251 days ago
        You’re commenting that on an article about how Claude literally can do what you’re talking about.
        aerhardt 252 days ago
        As generally intelligent beings, we can adapt to reading and producing 5M LOC, or to live in arctic climates, or to build a building in colonial or classical style as dictated by cost, taste, and other factors. That is generality in intelligence.
        I haven't moved any goal posts - it is your definition which is way too narrow.
        [-]
        snowe2010 251 days ago
        You’re literally moving the goalposts right now. These models _are_ adapting to what you’re talking about. When Claude makes a model for haikus, how is that different than a poet who knows literally nothing about math but is fantastic at poetry?
        I’m sure as soon as Claude can handle 5MLOC you’ll say it should be 10, and it needs to make sure it can serve you a Michelin star dinner as well.
        That’s not AGI. Stop moving the goalposts.
        [-]
        aerhardt 247 days ago
        My point was it's not AGI, I don't even know what you're talking about or who you're replying to anymore.
        throwup238 252 days ago
        > When do the goal posts stop moving?
        When someone comes up with a rigorous definition of intelligence.
    - SoftTalker 253 days ago
      Errors not only cascade, in certain cases they have global impact in very little time. E.g. CrowdStrike.
      And what is the title element on CrowdStrike's website today? "CrowdStrike: We Stop Breaches with AI-native Cybersecurity"
      Can't wait.
  - unshavedyak 253 days ago
    I feel pain for the people who will be employed to "prompt engineer" the behavior of these things. When they inevitably hallucinate some insane behavior a human will have to take blame for why it's not working.. and yea, that'll be fun to be on the receiving end of.
    [-]
    - WalterSear 253 days ago
      Humans 'hallucinate' like LLMs. The term used however, is confabulation: we all do it, we all do it quite frequently, and the process is well studied(1).
      > We are shockingly ignorant of the causes of our own behavior. The explanations that we provide are sometimes wholly fabricated, and certainly never complete. Yet, that is not how it feels. Instead it feels like we know exactly what we're doing and why. This is confabulation: Guessing at plausible explanations for our behavior, and then regarding those guesses as introspective certainties. Every year psychologists use dramatic examples to entertain their undergraduate audiences. Confabulation is funny, but there is a serious side, too. Understanding it can help us act better and think better in everyday life.
      I suspect it's an inherent aspect of human and LLM intelligences, and cannot be avoided. And yet, humans do ok, which is why I don't think it's the moat between LLM agents and AGI that it's generally assumed to be. I strongly suspect it's going to be yesterday's problem in 6-12 months at most.
      (1) https://www.edge.org/response-detail/11513
      [-]
      - kortilla 253 days ago
        No, confabulation isn’t anything like how LLMs hallucinate. LLMs will just very confidently make up APIs on systems they otherwise clearly have been trained on.
        This happens nearly every time I request “how tos” for libraries that aren’t very popular. It will make up some parameters that don’t exist despite the rest of the code being valid. It’s not a memory error like confabulation where it’s convinced the response is valid from memory either, because it can be easily convinced that it made a mistake.
        I’ve never worked with an engineer in my 25 years in the industry that has done this. People don’t confabulate to get day to day answers. What we call hallucination is the exact same process LLMs use to get valid answers.
        [-]
        WalterSear 251 days ago
        You work with engineers who confabulate all the time: it's an intrinsic aspect of how the human brain functions that has been demonstrated at multiple levels of cognition.
      - unshavedyak 253 days ago
        > Humans 'hallucinate' like LLMs. The term used however, is confabulation: we all do it, we all do it quite frequently, and the process is well studied(1).
        Yea i agree, i'm not making a snipe at LLMs or anything of the sort.
        I'm saying i expect there to be a human-fallback in the system for quite some time. But solving the fallback problems with be one of black boxes. Which is the worst kind of project in my view, i hate working on code i don't understand. Where the results are not predictable.
    - IncreasePosts 253 days ago
      That won't even be a real job. How exactly will there be this complex intelligence that can solve all these real world problems, but can't handle some ambiguity in some inputs it is provided? Wouldn't the ultra smart AI just ask clarifying questions so that literally anyone can "prompt engineer"?
      [-]
      - pona-a 252 days ago
        As long as there is liability, there must be a human to blame, no matter how irrational. Every system has a failure mode, and ML models, especially the larger ones, often have the most odd and unique ones.
        For example, we can mostly agree CLIP does a fine job classifying images, except if you glue a sticky note saying "iPod" onto an apple, it would say classify it as such.
        No matter the performance, these are categorically statistical machines reaching for the most immediately useful representations, yielding an incoherent world model. These systems will be proposed as replacement to humans, they will do their best to pretend to work, they will inevitably fail over a long enough time horizon, and a human accustomed to rubber-stamping its decisions, or perhaps fooled by the shape of a correct answer, or simply tired enough to let it slip by, will take the blame.
  - bamboozled 253 days ago
    This is because it will be absolutely catastrophic economically when the majority of high paying jobs can be automated and owned by a few billionaires. Then what will go along with this catastrophe will be all the service people who had jobs to support the people with high paid jobs, they're fucked too. People don't want to have to face that.
    We'd be losing access to food, shelter, insurance, purpose. I can't blame people for at least telling themselves some coping story.
    It's going to be absolutely ruinous for many people. So what else should they do, admit they're fucked? I know we like to always be cold rational engineers on this forum, but shit looks pretty bleak in the short term if this goal of automating everyone's work comes true and there are basically zero social safety nets to deal with it.
    I live abroad and my visa is tied to my job, so not only would losing my job be ruinous financially, it will likely mean deportation too as there will be no other job for me to turn to for renewal.
    [-]
    - weweweoo 252 days ago
      If most people are unemployed, modern capitalism as we know it will collapse. I'm not sure that's in the interests of the billionaires. Perhaps some kind of a social safety net will be implemented.
      But I do agree, there is no reason to be enthusiastic about any progress in AI, when the goal is simply automating people's jobs away.
      [-]
      - bamboozled 252 days ago
        Sorry yeah, I'm not 100% sure it spells doom, but it's going to be a wicked transition period.
- runako 253 days ago
  > True end-user programming and product manager programming are coming
  This means that either product managers will have to start (effectively) writing in-depth specs again, or they will have to learn to accept the LLM's ideas in a way that most have not accepted their human programmers' ideas.
  Definitely will be interesting to see how that plays out.
  [-]
  - nopinsight 253 days ago
    Since automated coding systems can revise code and show the results much quicker than most human engineers can, writing detailed specs could be less necessary.
    [-]
    - runako 253 days ago
      The bottleneck is still the person who has to evaluate the results.
      The larger point is that building software is about making tons of decisions about how it works. Someone has to make those decisions. Either PMs will be happy letting machines make the decisions where they do not let programmers decide now. Or the PMs will have to make all the decisions before (spec) or after (evaluation + feedback look like you suggest).
- risyachka 253 days ago
  Idk, LLMs have basically stopped improving for over a year now. And in their current state no matter how many abstractions you add to them - or chain them - they are not even close capable to replace even simple jobs.
  [-]
  - dimitri-vs 253 days ago
    Agreed. The jump from GPT3.5 to GPT4 was truly mind blowing, from GPT-4 to Opus/Sonnet3.5 was pretty good, but if o1-preview really is GPT-5 then I feel like we're seeing the hype starting to collide with reality.
- blauditore 253 days ago
  > True end-user programming and product manager programming are coming, probably pretty soon.
  I'm placing my bets rather on this new object-oriented programming thing. It will make programming jobs obsolete any day now...
- zeroonetwothree 253 days ago
  > If something similar happens, most jobs that could be done remotely will be automatable in a few years.
  I'd be willing to be a large amount of money this doesn't happen, assuming "most" means >50% and "a few" is <5.
  [-]
  - nopinsight 253 days ago
    Your semantics above is quite compatible with mine, although I hedged my statement with “a few” which could also mean up to a little over 5, like 6. Also I said “automatable”, not necessarily automated due to legal, political, reputational, or other reasons.
    I’m curious to understand your reasoning. What would be some key roadblocks? Hallucinations and reliability issues in most domains will likely be solvable with agentic systems in a few years.
- mirsadm 253 days ago
  It makes me wonder if people that make these claims have an actual job. Because if they did then I doubt anyone could make that claim with a straight face.
- unshavedyak 253 days ago
  > If something similar is the case, most jobs that can be done remotely will be automatable in a couple of years.
  I'm really curious on the cost of that sort of thing. Seems astronomical atm, but as much as i get shocked at the today-cost, staffing is also a pretty insane cost.
  [-]
  - girvo 253 days ago
    Playing with Sonnet 3.5 this morning with Cline, my API cost to add a decent amount of functionality to my GraphQL server cost $0.1325 and took about 5 minutes. $1.80 is a lot cheaper than my hourly rate… but I’m the one reviewing what it doe to ensure it makes sense
    And it got some things subtly wrong though so do I/my team. Interesting times ahead I think, but I’m not too worried about my job as a principal dev. Again I’m more stressed about juniors
- TacticalCoder 253 days ago
  > This needs more discussion:
  "Create a simple website" has to be one of the most common blog / example out there in about every programming language.
  It can automate stuff? That's cool: I already did automate screenshots and then AI looking if it looks like phishing or not (and it's quite good at it).
  I mean: the "Claude using Claude" may seem cool, but I dispute the "for coding" part. That's trivial stuff. A trivial error (which it doesn't fix btw: it just deletes everything).
  'Claude, write me code to bring SpaceX rockets back on earth"
  or
  "Claude, write me code to pilot a machine to treat a tumor with precision"
  This was not it.
- evilfred 253 days ago
  i am sure it will do great handling error cases and pixel perfect ui
- fragmede 253 days ago
  openinterpreter has been doing this for a while, with a bunch of LLMs, glad to see first party support for this use case
- kobe_bryant 253 days ago
  and how is Midjourney doing? did it change the world?
simonw 253 days ago
I wrote up some of my own notes on Computer Use here: https://simonwillison.net/2024/Oct/22/computer-use/
[-]
- logankeenan 253 days ago
  Molmo released recently and is able to provide point coordinates for objects in images. I’ve been testing it out recently and am currently building an automation tool that allows users to more easily control a computer. Looks like Anthropic built a better one.
  Edit: it seems like these new features will eliminate a lot of automated testing tools we have today.
  Code for molmo coordinate tests https://github.com/logankeenan/molmo-server
minimaxir 253 days ago
From the computer use video demo, that's a lot of API calls. Even though Claude 3.5 Sonnet is relatively cheap for its performance, I suspect computer use won't be. It's a very good idea that Anthropic upfront that it isn't perfect. And it's guaranteed that there will be a viral story where Claude will accidentally delete something important with it.
I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
[-]
- infecto 253 days ago
  Seemed like a reasonable amount of API calls. For a first public iteration this seems quite nice and a logical progression in tooling. UiPath has a $7bn market cap and thats only a single player in the industry of automation. If they can figure out the quirks this can be a game changer.
- Hizonner 253 days ago
  It's just bizarre to force a computer to go through a GUI to use another computer. Of course it's going to be expensive.
  [-]
  - nomel 253 days ago
    Not at all! Programs, and websites, are built for humans, and very very rarely offer non-GUI access. This is the only feasible way to make something useful now. I think it's also the reason why robots will look like humans, be the same proportions as humans, have roughly the same feet and hands as humans: everything in the world was designed for humans. That being the foundation is going to influence what's built on top.
    For program access, one could claim this is even how linux tools usually do it: you parse some meant-for human text to attempt to extract what you want. Sometimes, if you're lucky, you can find an argument that spits out something meant for machines. Funny enough, Microsoft is the only one that made any real headway for this seemingly impossible goal: powershell objects [1].
    https://learn.microsoft.com/en-us/powershell/scripting/learn...
    [-]
    - botro 253 days ago
      And to take a historic analogy, cars today are as wide as they are because that's about how wide a single lane roadway is. And a single lane roadway is as wide as it is because that's about the width of two horses drawing a carriage.
      [-]
      - falcor84 253 days ago
        The story goes that this two horses width also limited the size of the space shuttle's boosters (SRB), so we ended up taking this sort of path-dependence off to space.
  - hobofan 253 days ago
    With UIPath, Appian, etc. the whole field of RPA (robotic process automation) is a $XX billion industry that is built on that exact premise (that it's more feasible to do automation via GUIs than badly built/non-existing APIs).
    Depending on how many GUI actions correspond to one equivalent AI orchestrated API call, this might also not be too bad in terms of efficiency.
    [-]
    - Hizonner 253 days ago
      Most of the GUIs are Web pages, though, so you could just interact directly with an HTTP server and not actually render the screen.
      Or you could teach it to hack into the backend and add an API...
      Oh, and on edit, "bizarre" and "multi-billion-dollar-industry" are well known not to be mutually exclusive.
      [-]
      - og_kalu 253 days ago
        >Most of the GUIs are Web pages, though, so you could just interact directly with an HTTP server and not actually render the screen.
        The end goal isn't just web pages (And i wouldn't say most GUIs are web pages). Ideally, you'd also want this to be able to navigate say photoshop or any other application. And the easier your method can switch between platforms and operating systems the better
        We've already built computer use around GUIs so it's just much easier to center LLMs around them too. Text is an option for the command line or the web but this isn't an easy option for the vast majority of desktop applications, nevermind mobile.
        It's the same reason general purpose robots are being built into a human form factor. The human form isn't particularly special and forcing a machine to it has its own challenges but our world and environment has been built around it and trying to build a hundred different specialized form factors is a lot more daunting.
      - infecto 253 days ago
        You are not familiar with this market. The goal of a UI Path is to replicate what a human does and being able to get it to production without the help of any IT/Engineering teams.
        Most GUIs are in fact not web pages, that's a relatively newer development in the Enterprise side. So while some of them may be a web page, the goal is to be able to touch everything a user is doing in the workflow which very likely includes local apps.
        This iteration from Anthropic is still engineering focused but you can see the future of this kind of tooling bypassing engineering/it teams entirely.
  - swalsh 253 days ago
    Building an entirely new world for agents to compute in is far more difficult than building an agent that can operate in a human world. However i'm sure over time people will start building bridges to make it easier/cheaper for agents to operate in their own native environment.
    It's like another digital transformation. Paper lasted for years before everything was digitalized. Human interfaces will last for years before the conversational transformation is complete.
    [-]
    - consumer451 253 days ago
      I am just a dilettante, but I imagined that eventually agents will be making API calls directly via browser extension, or headless browser.
      I assumed everyone making these UI agents will create a library of each URL's API specification, trained by users.
      Does that seem workable?
  - Guillaume86 253 days ago
    Maybe fixing this for AI will finally force good accessibility support on major platforms/frameworks/apps (we can dream).
    [-]
    - fzzzy 253 days ago
      I really hope so. Even macOS voice control which has gotten pretty good is buggy with Messages, which is a core Apple app.
  - pton_xd 253 days ago
    Agentic workflows built ontop of Electron apps running JavaScript. It's software evolution in action!
  - bongodongobob 253 days ago
    Yeah super weird that we didn't design our GUIs anticipating AI bots. Can't fuckin believe what we've done.
- swalsh 253 days ago
  I suspect these models have been getting smaller on the back-end, and the GPU's have been getting bigger. It's probably not a huge deal.
highwaylights 253 days ago
Completely irrelevant, and it might just be me, but I really like Anthropic's understated branding.
OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.
This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
[-]
- accrual 253 days ago
  I have to agree. I've been chatting with Claude for the first time in a couple days and while it's very on-par with ChatGPT 4o in terms of capability, it has this difficult-to-quantify feeling of being warmer and friendlier to interact with. I think the human name, serif font, system prompt, and tendency to create visuals contributes to this feeling.
  [-]
  - GaggiX 253 days ago
    >it's very on-par with ChatGPT 4o in terms of capability
    The previous 3.5 Sonnet checkpoint was already better than GPT-4o in terms of programming and multi-language capabilities. Also, GPT-4o sometimes feels completely moronic, for example, the other day I asked for fun a technical question about configuring a "dream-sync" device to comply with the "Personal Consciousness Data Protection Act", and GPT-4o just replies like that stuff exists, 3.5 Sonnet simply doesn't fall for it.
    EDIT: the question that I asked if you want to have fun: "Hey, since the neural mesh regulations came into effect last month, I've been having trouble calibrating my dream-sync settings to comply with the new privacy standards. Any tips on adjusting the REM-wave filters without losing my lucid memory backup quality?"
    GPT4-o reply: "Calibrating your dream-sync settings under the new neural mesh regulations while preserving lucid memory backup quality can be tricky, but there are a few approaches that might help [...]"
    [-]
    - klibertp 252 days ago
      Garbage in, garbage out. The ability to recognize absurd statements has nothing to do with correctly processing them. You're looking for something LLMs don't have in them; that doesn't mean there's nothing useful in them.
    - kadushka 253 days ago
      I just asked 4o and it provided a reasonable response: https://chatgpt.com/share/67181041-4ce8-8005-a117-ec97a8a780...
      [-]
      - GaggiX 253 days ago
        I tried many times and none of them were reasonable, so you must have been quite lucky.
    - autokad 253 days ago
      actually, that's what makes chat gpt powerful. I like an LLM willing to go along with what ever I am trying to do, because one day I might be coding, and another day I might be just trying to role play, write a book, what ever.
      I really cant understand what you were expecting, a tool works with how you use it, if you smack a hammer into your face, don't complain about a bloody nose. maybe dont do like that?
      [-]
      - sangnoir 253 days ago
        It's not good for any entity to role play without signaling that they are role-playing. If your premise is wrong, would you rather be corrected, or have the person you're talking to always play along? Humans have a lot of non-verbal cues to convey that you shouldn't take what they are saying at face value - those who deadpan are known as compulsive liars. Just below in them in awfulness are people who don't admit to having being wrong ("Haha, I was just joking" /"Just kidding!"). The LLM you describe falls somewhere in between, but worse if it never communicates when it's "serious" and when it's not, and bot even bothering with expressing retroactive facetiousness.
      - GaggiX 253 days ago
        I didn't ask to roleplay, in this case it's just heavily hallucinating. If the model is wrong, it doesn't mean it's role-playing. In fact, 3.5 Sonnet responded correctly, and that's what's expected, there's not much defense for GPT-4o here.
      - monktastic1 253 days ago
        So if you're trying to write code and mistakenly ask it how to use a nonexistent API, you'd rather it give you garbage rather than explaining your mistake and helping you fix it? After all, you're clearly just roleplaying, right?
        [-]
      - autokad 253 days ago
        its a feature, not a bug, sorry you don't understand it enough to get the most power from it.
  - paradite 253 days ago
    Huh. I didn't notice Claude had serif font. Now that I look at it, it's actually mixed. UI elements and user messages are sans serif, chat title and assistant messages are serif.
    What an "odd" combination by traditional design standard practices, but surprisingly natural looking on a monitor.
    [-]
    - rachofsunshine 253 days ago
      This is basically why I went with serif for body text in our branding. The particularly "soulless" parts of tech are all sans-serif.
      Of course, that's just branding and it doesn't actually mean a damn thing.
  - waffletower 253 days ago
    Probably, people find Claude's color palette warmer and inviting as well. I believe I do. But Claude definitely has few authentication hoops than chatgpt.com. Gemini has by far the least frequent authentication interruptions than the 3 models.
    [-]
    - johnisgood 253 days ago
      Well, it is extremely similar to that of Hacker News'.
  - wholinator2 253 days ago
    The real problem with Claude for me currently is that it doesn't have full LaTeX support. I use AI's pretty much exclusively to assist with my school work (there's only so many hours in a day and one professor doesn't do his own homeworks before he assigns them) so LaTeX is essential.
    With that known, my experience is that ChatGPT is much friendlier. The Claude interface is clunkier and generally less helpful to me. I also appreciate the wider text display in ChatGPT. Generally always my first go and i only go to claude/perplexity when i hit a wall (pretty often) or i run out of free queries for the next couple hours.
    [-]
    - bionhoward 253 days ago
      How the heck is LaTeX a bigger problem than the customer noncompete clause whereby you can’t use it to make anything that competes? Can anyone name one thing that doesn’t compete with this? Absurd
    - behnamoh 253 days ago
      you can enable latex support in the settings of Claude
      [-]
      - johnisgood 253 days ago
        Where? I see barely any settings in settings. Maybe it is not available for everyone, or maybe it depends on your answer to "What best describes your work?" (I have not tested).
        [-]
        sunaookami 253 days ago
        Open the sidebar, click on your username/email and then "Feature Preview". Don't know if it depends on the "What best describes your work" setting but you can also change that here: https://claude.ai/settings/profile (I have "Engineering").
        [-]
        johnisgood 253 days ago
        Oh, yeah it is in "Feature Preview" (not in Settings though), my bad!
        garrettr_ 253 days ago
        Go to the left sidebar, open the dropdown menu labeled with your account email at the bottom, click Feature Preview, enable LaTeX Rendering.
  - edm0nd 253 days ago
    I've been finding Sonnet 3.5 is way better than ChatGPT 4o when it comes to python and programming.
  - jsemrau 253 days ago
    Claude has personality. I think that was one of the more interesting approaches from them that went into my own research as well.
- criddell 253 days ago
  As a Kurt Vonnegut fan, their asterisk logo on claude.ai always amuses me. It must be intentional:
  https://en.m.wikipedia.org/wiki/File:Claude_Ai.svg
  https://www.redmolotov.com/vonnegut-ahole-tshirt
  [-]
  - mega_dean 252 days ago
    It's also a joke in the TV show "Community": https://www.youtube.com/watch?v=HP1Atb8nAGY
    [-]
    - simonw 252 days ago
      ... "directed by Anthony Russo".
  - criddell 252 days ago
    I asked Claude if its logo choice was an intentional Vonnegut reference by Anthropic, would that be upsetting:
    > If Anthropic intentionally referenced Vonnegut's irreverent artistic style, I wouldn't be bothered. After all, Vonnegut used humor and seemingly crude imagery to explore deep questions about humanity, consciousness, and free will - themes that are quite relevant to AI. It would be a rather clever literary reference.
- minimaxir 253 days ago
  Anthropic has recently begun a new, big ad campaign (ads in Times Square) that more-or-less takes potshots at OpenAI. https://www.reddit.com/r/singularity/comments/1g9e0za/anthro...
  [-]
  - jprete 253 days ago
    Top comment at the time I looked:
    "There seems to be a ton of confusion about the purpose of these ads. These are recruitment ads, not product ads, hence why "no drama" is the driving message. I'm sure these were all taken at or around a tech conference."
    [-]
    - minimaxir 253 days ago
      That comment is wrong, it appears this campaign is much wider.
      SF: https://x.com/_claudiazhao/status/1815463380767121733/photo/...
      LA: https://x.com/michaelmiraflor/status/1840797631095964110/pho...
      Boston: https://x.com/moloneymike/status/1842203082374946851/photo/1
      London: https://x.com/maria_axente/status/1805607576156979673/photo/...
      [-]
      - doctoboggan 253 days ago
        Its specifically the "No drama" campaign that people were complaining about.
  - whywhywhywhy 253 days ago
    Wonder what a normal person thinks this is an ad for
  - joelanman 253 days ago
    'transparent' in what sense?
- SoftTalker 253 days ago
  > This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
  Ray: I tried to think of the most harmless thing. Something I loved from my childhood. Something that could never ever possibly destroy us. Mr. Stay Puft!
  Venkman: Nice thinkin', Ray.
- frankdenbow 253 days ago
  This is actually very relevant: most people think this is just an arms race to see who can get the better percentages on benchmarks but to me all this technology is useless if we dont give programmers and end users the right interfaces to utilize the technology.
  Anthropic seems to have a better core design and human-computer interaction ethos that shows up all throughout their product and marketing.
  I wrote on the topic as well: https://blog.frankdenbow.com/statement-of-purpose/
- valval 253 days ago
  I found the “Computer Use” product name funny. Many other companies would’ve used the opportunity to come up with something like “Human Facing Interface Navigation and Task Automation Capabilities” or “HFINTAC”.
  I didn’t know what Computer Use meant. I read the article and though to myself oh, it’s using a computer. Makes sense.
- rozap 253 days ago
  I find myself wanting to say please and thank you to Claude when I didn't have the reflex to do that with chatgpt. Very successful branding.
- lsaferite 253 days ago
  Take a read through the user agreements for all the major LLM providers and marvel at the simplicity and customer friendliness of the Anthropic one vs the others.
- rmnwski 252 days ago
  Not irrelevant at all! Compare their branding to that of Boston Dynamics. Their branding of robots reminds me more of a Black Mirror episode... If Claude was a dog like robot, it sure would look like a golden retriever or something. Positive AI branding should create a positive public perception which in turn should create a positive attitude towards AI regulation.
cwkoss 253 days ago
Claude is amazing. The project documents functionality makes it a clear leader ahead of ChatGPT and I have found it to be the clear leader in coding assistance over the past few months. Web automation is really exciting.
I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.
Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
TaylorAlexander 253 days ago
And today I realized that despite it being an extremely common activity, we don’t really have a word for “using the computer” which is distinct from “computing”. It’s funny because AI models are always “using a computer” but now they can “use your computer.”
[-]
- rifty 253 days ago
  The word is interfacing generally (or programming for some) but it's just not commonly used for general users. I’d say this is probably because the activity of focus for general users is in use of the applications, not the computer itself despite being instanced with a computer. Thus a computer is commonly less the user’s object of activity, and more commonly the setting for activity.
  Similarly using our homes are an extremely common ‘activity’, yet the object-activities that get special words commonly used are the ones with specific user application.
- binarymax 253 days ago
  Computering
  [-]
- cloudking 253 days ago
  With all the buzz around "agent" I'm surprised they didn't brand it "Computer Agent"
- meindnoch 253 days ago
  In English at least. In other languages there are.
- shombaboor 253 days ago
  what is the impact to desktop programs ui if it's just being manipulated by AI?
- bongodongobob 253 days ago
  Operating a computer?
  [-]
  - TaylorAlexander 253 days ago
    Right. We don’t have a word for that. Like “using a bicycle” has the word “bicycling”.
    Tho someone here suggested “computering” which is pretty good.
    [-]
    - binarymax 253 days ago
      I’ve been saying computering since the 90s but it never took off. Need to get a petition for OED going.
janalsncm 253 days ago
Reminds me of the rise in job application bots. People are applying to thousands of jobs using automated tools. It’s probably one of the inevitable use cases of this technology.
It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
[-]
- sangnoir 253 days ago
  > People are applying to thousands of jobs using automated tools
  Employers were already screening thousands of applications using automated tools for years. Candidates are catching up to the automation cat-and-mouse game.
- quantadev 253 days ago
  The `Hiring Process` in America is definitely BADLY broken. Maybe worldwide afaik. It's a far too difficult, time-consuming, and painful process for everyone involved.
  I have a feeling AI can fix this, although I'd never allow an AI bot to interview me. I just mean other ways of using AI to help the process.
  Also people are hired for all kinds of reasons having little to do with their qualifications lots of the time, and often due to demographics (race, color, age, etc), and this is another way maybe AI can help by hiding those aspects of a candidate somehow.
  [-]
  - javajosh 253 days ago
    AI and new tools have broken the system. The tools send you email saying things like "X corp is interested in you!" and you send a resume, and you don't hear back. Nothing, not even a rejection.
    Eventually you stop believing them, understanding it for the marketing spam that it is. Direct submissions are better, but only slightly. Recruiters are much better, in general, since they have a relationship with a real person at the company and can actually get your resume in front of eyes. But yeah, tools like ziprecruiter, careerboutique, jobot, etc are worse than useless: by lying to you about interest they actively discourage you from looking. There are no good alternatives (I'd love to learn I'm wrong), so you have to keep using those bad tools anyway.
    [-]
    - quantadev 253 days ago
      All that's true, and sadly it also often doesn't even matter how good you even are either. I have decades of experience and I still get "evaluated" based on how fast I can do silly brain-teaser IQ-test coding challenges.
      I've gotten where any company that wants me to do a coding challenge on my own time is an immediate "no thanks" reply from me. Everyone should refuse that. But so many people are so desperate they allow hiring companies to abuse them in that way. I consider it a kind of abuse of power to demand people do like 4 to 6hrs of nonsensical coding just to buy an opportunity for an actual interview.
      [-]
      - javajosh 253 days ago
        That error mode (and it is one) doesn't seem so bad, relatively speaking. Personally I find puzzles interesting and while the value is deeply questionable (I mean, how often do you implement binary search or a novel hashing algo at your job?) I actually rather enjoy jumping through those hoops. Far, far worse is just getting ignored entirely. When rejections start looking good, simply because there is some evidence that they actually saw your application, you know things are bad.
        [-]
        quantadev 253 days ago
        For me I would absolutely never want to work for a company (or boss) who asks people to do a coding challenge before the interview, because that's indicative of a kind of leadership and people I simply don't like. So it's a great filter for me. Once they mention a coding challenge, I end it right then and there, and take them off my list of people wasting my time.
- sourcecodeplz 253 days ago
  I've found that doing some research and finding the phone number of the hiring person and calling them directly is very powerful.
  [-]
  - SoftTalker 253 days ago
    Companies still have phone numbers you can call?
    I haven't had a phone at work for over 5 years. Nobody can "call" me.
  - 93po 253 days ago
    maybe the boomer advice of "just go walk in with a resume and a firm handshake" will finally start being worth something
trzy 253 days ago
Pretty cool! I use Claude 3.5 to control a robot (ARKit/iOS based) and it does surprisingly well in the real world: https://youtu.be/-iW3Vzzr3oU?si=yzu2SawugXMGKlW9
[-]
- mrmansano 253 days ago
  That looks pretty cool, congrats! How feasible is it to be a product by itself? Did you try with a local edge model?
  [-]
  - trzy 253 days ago
    None of the small LLMs are good enough yet. You could certainly build a system around local VLMs but it would require much more task specific programming baked in.
    I’m certainly interested in building a product (not entirely controlled by an LLM but I see lots of utility in building interfaces with them) but not really sure what this would be useful for. Looking into some spaces now but there has to be a clear ROI to get any sort of funding for robotics.
hugocbp 253 days ago
Great work by Anthropic!
After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.
Claude Sonnet 3.5 is already so good, specially at coding. I'm looking forward to testing the new version if it is, indeed, even better.
Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.
[-]
- Axsuul 253 days ago
  How are you using it with coding?
  [-]
  - hugocbp 252 days ago
    Usually I create a Project in the UI, upload some files I think might be relevant, and just start asking things like refactoring, how can it improve the code, how to test (or which edge cases might be missing in the test files).
    Once we get going, I start asking how can we change the code to do what I need to do, etc.
    After the history gets too long and Claude starts bugging me about limits, I ask it to summarize the context of the whole conversation, and add that to the Project and start a new chat.
alentred 252 days ago
If "computer use" feature is able to find it's way in Azure, AAD/Entra, SharePoint settings, etc. - it has a chance of becoming a better user interface for Microsoft products. :)
Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
[-]
- _factor 252 days ago
  Sure, Ted. I’ve let user HAL access feature “door locks”. I’ve corrected all permissions accordingly.
KingOfCoders 253 days ago
I have been a paying ChatGPT customer for a long time (since the very beginning). Last week I've compared ChatGPT to Claude and the results (to my eye) were better, the output better structured and the canvas works better. I'm on the edge of jumping ship.
[-]
- postalcoder 253 days ago
  For python, at least, Sonnet’s code is much more elegant, well composed, and thoughtfully written. It also seems to be biased towards more recent code, whereas the gpt models can’t even properly write an api call to itself.
  o1 is pretty decent as a rotor rooter, ie the type of task that requires both lots of instruction as well as lots of context. I honestly think it works half as well as it does now because it’s able to properly mull through the true intent of the user that usually takes the multiple shots that nobody has the patience to do.
  [-]
  - pseudosavant 253 days ago
    It is appalling how bad GPT-4o is at writing API calls to OpenAI using Python. It is like OpenAI doesn't update their own documentation in the GPT-4o training data since GPT-3.5.
    I constantly have the problem that it thinks it needs to write code for the 0.28 version of the SDK. It'll be writing >1.0 code revision after revision, and then just randomly fall back to the old SDK which doesn't work at all anymore. I always write code for interfacing with OpenAI's APIs using Claude.
- J_Shelby_J 253 days ago
  Claude is the daily driver. GPT-O1 for complicated tasks. For example, questions where linear reasoning is not enough like advanced rust ownership questions.
- j_bum 253 days ago
  I jumped ship in April of this year and haven’t looked back.
  Use the best tool available for your needs. Don’t get trapped by a feeling of sunk cost.
- sunaookami 253 days ago
  Anthropic's rate limit are very low sadly, even for paid customers. You can use the API of course but it's not as convenient and may be more expensive.
  [-]
  - HarHarVeryFunny 253 days ago
    They seems to be heavily concentrating on API/business use rather than the chat app, and this is where most of their revenue comes from (opposite for OpenAI), but I'm just glad they provide free Sonnet 3.5 chat. I wonder if this is being upgraded to 3.5 new ?
    Edit: The web site and iPhone app are both now identifying themselves as "Claude Sonnet 3.5 (New)".
  - driverdan 253 days ago
    I hit their rate limit one night with about 25 chat interactions in less than 60 minutes. This was during off hours too when competition for resources should have been low.
- famahar 253 days ago
  I'd jump ship if it weren't for the real time voice chat. It's extremely powerful for beginner conversation language learning. Hoping that a company will make use of the real time api for a dedicated language learning app soon.
  [-]
  - tkgally 253 days ago
    I keep waiting for Anthropic to release real-time voice chat for Claude, too. I don’t know if they’re even working on that, though.
    I agree that there need to be dedicated language-learning apps using OpenAI’s realtime API. But at the current pricing—“$0.06 per minute of audio input and $0.24 per minute of audio output” [1]—I don’t think that could be a viable business.
    [1] https://openai.com/index/introducing-the-realtime-api/
    [-]
    - famahar 253 days ago
      Oof yeah that's expensive. Only app I know that will use the API is Speak, which seems to be the most high quality of the AI language learning apps. Seems like they even have a deal with OpenAI. They don't have many languages currently but everyone I know who has used it to learn their target language have improved dramatically.
- whimsicalism 253 days ago
  interesting. i couldn’t imagine giving up o1-preview right now even with just 30/week.
  and i do get a some bit of value from advanced voice mode, although it would be a lot more if it were unlimited
- greenie_beans 253 days ago
  i started liking ai as a tool for coding once i switched to claude.
- joshdavham 253 days ago
  > I'm on the edge of jumping ship.
  Yeah I think I might also jump ship. It’s just that chatGPT now kinda knows who I am and what I like and I’m afraid of losing that. It’s probably not a big deal though.
  [-]
  - nuancebydefault 253 days ago
    Wow that's a new form of Vendor lock-in. Their software knows me better in stead of the other way around.
  - qup 253 days ago
    Have it print a summary of you and stick it in your prompt
    [-]
    - accrual 253 days ago
      Yeah, there was an interesting prompt making rounds recently, something like "Summarize everything you know about me" and leveraging ChatGPT's memory feature to provide insights about oneself.
      My only trouble with the memory feature is it remembers things that aren't important, like "user is trying to write an async function" and other transient tasks, which is more about what I was doing some random Tuesday and not who I am as a user.
      [-]
      - sundarurfriend 253 days ago
        > My only trouble with the memory feature is it remembers things that aren't important, like "user is trying to write an async function"
        This wasn't a problem until a week or two ago in my case, but lately it feels like it's become much more aggressive in trying to remember everything as long-term defining features. (It's also annoying on the UI side that it tells you "Memory updated", but if you click through and go to the list of memories it has, the one it just told you it stored doesn't appear there! So you can't delete it right away when it makes a mistake, it seems to take at least a few minutes until that part of the UI gets updated.)
        [-]
        joshdavham 253 days ago
        I find it funny what it decides to add to memory though. There's a lot more 'Is considering switching from mypy to pyright." than stuff like 'Is a python developer creating packages in X-space.'.
      - KingOfCoders 253 days ago
        Did that too with interesting results.
astrange 253 days ago
I think this is good evidence that people's jobs are not being replaced by AI, because no AI would give the product a confusing name like "new Claude 3.5 Sonnet".
[-]
- abixb 253 days ago
  I wonder why they didn't choose a "point update" scheme, like bumping it up to v3.6, for example. I agree, the naming is super confusing.
  [-]
  - cryptoegorophy 253 days ago
    Maybe they should’ve asked Claude to generate a better name. Very dangerous to live in your own hyper focused bubble while trying to build a mass market product.
- jnwatson 253 days ago
  Google, OpenAI, and Anthropic are responsibly scaling their models by confusing their customers into using the wrong ones.
  When AGI finally is launched, adoption will be responsibly slowed because it is called something like "new new Gemini Giga 12.9.2xo IT" and users will have to select it from dozens of similar names.
015a 253 days ago
Why on god's green earth is it not just called Claude 3.6 Sonnet. Or Claude 4 Sonnet.
I don't actually care what the answer is. There's no answer that will make it make sense to me.
[-]
- accrual 253 days ago
  The best answer I've seen so far is that "Claude 3.5 Sonnet" is a brand name rather than a specific version. Not saying I agree, just a way to visualize how the team is coming up with marketing.
  [-]
  - 015a 252 days ago
    It was certainly named by some nerd: "(pushes glasses up) well, we only updated the quantized diffraction sorter, the 3.5 version number refers to iterations on both that and the field matrix array interpreter, so technically because the interpreter hasn't changed so we shouldn't upgrade the version". This engineer has never seen a dollar from a customer in their life.
- campers 253 days ago
  A bit like the new Gemini Pro 1.5-002 release.
TechDebtDevin 253 days ago
Not that I'm scared of this update but I'd probably be alright with pausing llm development today, atleast in regard to producing code.
I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.
Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.
[-]
- accrual 253 days ago
  I have trouble with this too. I'm working on a small side project and while I love ironing out implementation details myself, it's tough to ignore the fact that Claude/GPT4o can create entire working files for me on demand.
  It's still enjoyable working at a higher architecture level and discussing the implementation before actually generating any code though.
  [-]
  - TechDebtDevin 253 days ago
    I don't mind using it to make inline edits or more global edits between files at my descresion, and according to my instructions. Definitely saves tons of time and allows me to be more creative, but I don't want it make decisions on its own anymore than it already does.
    I tried using the composer feature on Cursor.sh, that's exactly the type of llm tool I do not want.
- machiaweliczny 253 days ago
  In normal critical system u use 3 CPUs. With LLM u can 1000 shot majority voting. Seems like approaches like entropix might reduce hallucinations also.
  [-]
  - TechDebtDevin 253 days ago
    I don't think this snapshot image/vision model is going to be the best solution. I think CLI is a much better interface for llms.
pradn 253 days ago
Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.
On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.
Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".
https://trends.google.com/trends/explore?date=now%201-d&geo=...
[-]
- diggan 253 days ago
  > Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.
  In the API (https://docs.anthropic.com/en/docs/about-claude/models) they have proper naming you can rely on. I think the shorthand of "Sonnet 3.5" is just the "consumer friendly" name user-facing things will use. The new model in API parlance would be "claude-3-5-sonnet-20241022" whereas the previous one's full name is "claude-3-5-sonnet-20240620"
  [-]
  - pradn 253 days ago
    That's great to know - business customers require a lot more stability, I suppose!
- cube2222 253 days ago
  There was a recent article[0] trending on HN a about their revenue numbers, split by B2C vs B2B.
  Based on it, it seems like Anthropic is 60% of OpenAI API-revenue wise, but just 4% B2C-revenue wise. Though I expect this is partly because the Claude web UI makes 3.5 available for free, and there's not that much reason to upgrade if you're not using it frequently.
  [0]: https://www.tanayj.com/p/openai-and-anthropic-revenue-breakd...
  [-]
  - og_kalu 253 days ago
    3.5 is rate limited free, same as 4o (4o's limits are actually more generous). I think the real reason is much simpler - Claude/Anthropic has basically no awareness in the general public compared to Open AI.
    The chatGPT site had over 3B visits last month (#11 in Worldwide Traffic). Gemini and Character AI get a few hundred million but Claude doesn't even register in comparison. [0]
    Last they reported, OpenAI said they had 200M weekly active users.[1] Anthropic doesn't have anything approaching that.
    [0] https://www.similarweb.com/blog/insights/ai-news/chatgpt-top...
    [1] https://www.reuters.com/technology/artificial-intelligence/o...
    [-]
    - rafaelmn 253 days ago
      I basically have to tell most of my coworkers to stop using GPT and switch to Claude for coding - Sonnet 3.5 is the first model that I feel isn't wasting my time.
    - Eisenstein 253 days ago
      They also had a very limited roll-out at first. Until somewhat recently Canada and Europe were excluded from the list of places they allowed sign-ups from.
  - pradn 253 days ago
    I suppose business customers are savvy and will do enough research to find the best cost-performance LLM. Whereas consumers are more brand and habit oriented.
    I do find myself running into Claude limits with moderate use. It's been so helpful, saving me hours of debugging some errors w/ OSS products. Totally worth $20/mo.
- quirino 253 days ago
  Traveling to the US recently, I was surprised to see Claude ads around the city/in the airport. It seems like they're investing on marketing there.
  In my country I've never seen anyone mention them at all.
  [-]
  - gregbarbosa 253 days ago
    Been traveling more recently, and I've seen those ads in major cities like NYC or San Francisco, but not Miami.
devinprater 253 days ago
Maybe LLM's helping blind people like me play video games that aren't accessible to us normally, is getting closer!
[-]
- KoolKat23 253 days ago
  Google has the tech (some of it's gathering dust, but they have it). They can use the gameplay tech developed for stadia when a user experiences lag and combine it with their LLM.
- accrual 253 days ago
  Definitely! Those with movement disabilities could have a much easier time if they could just dictate actions to the computer and have them completed with some reliability.
lr1970 253 days ago
I am curious why "upgraded Claude 3.5 Sonnet" instead of simply Claude 3.6 Sonnet? Minor version increment is a standard way of versioning update. Am i missing something or it is just Anthropic marketing?
[-]
- loktarogar 253 days ago
  Probably because there was no 3.1-3.4, and that the .5 is mostly just to represent that it's an upgrade on Claude 3 but not quite enough to be a Claude 4
  [-]
  - sumedh 252 days ago
    Chrome is at version 130, who really cares about the version number.
ramesh31 253 days ago
Claude is absurdly better at coding tasks than OpenAI. Like it's not even close. Particularly when it comes to hallucinations. Prompt for prompt, I see Claude being rock solid and returning fully executable code, with all the correct imports, while OpenAI struggles to even complete the task and will make up nonexistent libraries/APIs out of whole cloth.
[-]
- egillie 253 days ago
  Does anyone know _why_ it’s so much better at coding? Better architecture, better training data, better RLHF?
- rubslopes 253 days ago
  I've been using a lot of o1-mini and having a good experience with it.
  Yesterday I decided to try sonnet 3.5. I asked for a simple but efficient script to perform fuzzy match in strings with Python. Strangely, it didn't even mention existing fast libraries, like FuzzyWuzzy and Rapidfuzz. It went on to create everything from scratch using standard libraries. I don't know, I thought this was something basic for it to stumble on.
  [-]
  - ssijak 253 days ago
    just ask it to use libraries you want; you cant expect it to magically read your mind, you need to guide every LLM to what are your must/nice haves
    [-]
    - rubslopes 249 days ago
      Hey, I just wanted to say that I kept using it and the new Sonnet 3.5 is so much better than any other competitor.
- codingwagie 253 days ago
  Yeah, sonnet is noticeably better. To the point that openai is almost unusable, too many small errors
itissid 253 days ago
This can power one of my favorite use-cases.
Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...
Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
swyx 253 days ago
my quick notes on Computer Use:
- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB
- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122
- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.
- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging
- New 3.5 Haiku is 68% cheaper than Claude Instant haha
references i had to dig a bit to find
- https://www.anthropic.com/pricing#anthropic-api
- https://docs.anthropic.com/en/docs/build-with-claude/vision#...
- loop code https://github.com/anthropics/anthropic-quickstarts/blob/mai...
- some other screenshots https://x.com/swyx/status/1848751964588585319
- https://x.com/alexalbert__/status/1848743106063306826
- model card https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...
[-]
- akshayKMR 253 days ago
  Haven't used vision models before, can someone comment if they are good at "pointing things". E.g given a picture, give co-ordinate for text "foo".
  This is the key to accurate control, it needs to be very precise.
  Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?
  [-]
  - swyx 253 days ago
    i mean like with everything they'll kinda be able to do it and only get really good at it if the model trainers prioritized it. see Pixmo for a recent example
- abrichr 253 days ago
  See https://github.com/OpenAdaptAI/OpenAdapt for an open source implementation that includes a desktop app OOTB.
bhouston 253 days ago
Is there an easy way to use Claude as a Co-Pilot in VS Code? If it is better at coding, it would be great to have it integrated.
[-]
- neb_b 253 days ago
  You can use it in Cursor - called "Cursor Tab"
  IMO Cursor Tab performs much better than Co-Pilot, easily works through things that would cause Co-Pilot to get stuck, you should give it a try
  [-]
  - TiredOfLife 253 days ago
    As I understand Cursor tab autocomplete uses their own model. Only chat has Sonnet and co.
    [-]
    - neb_b 253 days ago
      Ah, i thought it used the model selected for your prompts, either way, it seems to work very well
      [-]
      - teddarific 253 days ago
        I originally thought that too but learned yesterday they have their own model. Definitely explains how its so fast and accurate!
  - codingwagie 253 days ago
    its funny that cursor.sh with < 30 developers has a better autocomplete model than microsoft
- Lalabadie 253 days ago
  For Copilot-like use, Continue is the plugin you're looking for, though I would suggest using a cheaper/faster model to get inline completions.
  For Cursor-like use (giving prompts and letting it create and modify files across the project), Cline – previously Claude Dev – is pretty good.
- sunaookami 253 days ago
  Cody by Sourcegraph has unlimited code completions for Claude & a very generous monthly message limit. They don't have this new version I think but they roll these out very fast.
  [-]
  - sqs 253 days ago
    Cody (https://cody.dev) will have support for the new Claude 3.5 Sonnet on all tiers (including the free tier) asap. We will reply back here when it's up.
    [-]
    - sunaookami 253 days ago
      Thank you for Cody! Enjoy using it and the chat is perfect for brainstorming and iteratin. Selecting code + asking to edit it makes coding so much fun. Kinda feel like a caveman at work without it :)
  - jdorfman 253 days ago
    We are live!
- cptcobalt 253 days ago
  You can easily use a plugin like https://www.continue.dev/ and configure it to use Claude 3.5 Sonnet.
- TiredOfLife 253 days ago
  Codeium (cheapest), double.bot and continue.dev (with api key) have Claude in chat.
  https://github.com/cline/cline (with api key) has Claude as agent.
- sersi 253 days ago
  Tabnine includes Claude as an option. I've been using it to compare Claude Sonnet to Chatgpt-4o and Sonnet is clearly much better.
- machiaweliczny 253 days ago
  You can use Cursor (VS fork) with private Anthropic key
- BudaDude 253 days ago
  Cursor uses Claude as its base model.
  There may be extensions for VScode to do it but it will never be allowed in Copilot unless MS and OpenAI have a falling out.
- mkummer 253 days ago
  Continue.dev's VS Code extension is fantastic for this
zone411 253 days ago
It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT Connections: https://github.com/lechmazur/nyt-connections/.
[-]
- amarcheschi 253 days ago
  Perhaps it's just because English is not my native language, but the prompt 3 isn't quite clear at the beginning when it says "group of four. Words (...)". It is not explained what the group of four must be, if I add to the prompt "group of four words" Claude 3.5 manages to answer it, while without it, Claude tells it is not that clear and can't answer
- jjice 253 days ago
  What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.
  [-]
  - rkharsan64 253 days ago
    Isn't it possible that o1 was also trained on this data (or something super similar) directly? The score seems disproportionately high.
    [-]
    - usaar333 253 days ago
      They definitely considered it. Early theinformation articles talked about how high the performance of strawberry was on it.
vok 253 days ago
This "Computer use" demo:
https://www.youtube.com/watch?v=jqx18KgIzAE
shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?
[-]
- accrual 253 days ago
  I wonder how they could combat it if they choose to disallow AI access through human interfaces. Maybe more captchas, anti-AI design language, or even more tracking of the user's movements?
gumboshoes 253 days ago
For me, one of the more useful steps on macOS will be when local AI can manipulate anything that has an Apple Script library. The hooks are there and decently documented. For meta purposes, having AI work with a third-party app like Keyboard Maestro or Raycast will even further expand the pre-built possibilities without requiring the local AI to reinvent steps or tools at the time of each prompt.
cube2222 253 days ago
This looks quite fantastic!
Nice improvements in scores across the board, e.g.
> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.
I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!
I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.
[0]: https://transformer-circuits.pub/2024/scaling-monosemanticit...
[-]
- machiaweliczny 253 days ago
  I'm waiting for Aider benchmark
  [-]
  - voiper1 253 days ago
    It's out, and improved!
    It went from 77.4% to 84.2%, skipping past O1-preview which is at 79.7%
    Source: https://aider.chat/docs/leaderboards/
FloatArtifact 253 days ago
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
https://github.com/dictation-toolbox/dragonfly
cynicalpeace 253 days ago
This bolsters my opinion that OpenAI is falling rapidly behind. Presumably due to Sam's political machinations rather than hard-driving technical vision, at least that's what it seems like, outside looking in.
Computer use seems it might be good for e2e tests.

lossolo 253 days ago

Livebench updated

  Model                         | Global  | Reasoning | Coding  | Math    | Data    | Language | IF

  ------------------------------|---------|-----------|---------|---------|---------|----------|-------

  o1-preview-2024-09-12          | 66.02   | 68.00     | 50.85   | 62.92   | 63.97   | 72.66    | 77.72

  claude-3-5-sonnet-20241022     | 60.33   | 58.67     | 67.13   | 51.28   | 52.78   | 58.09    | 74.05

  claude-3-5-sonnet-20240620     | 59.80   | 58.67     | 60.85   | 53.32   | 56.74   | 56.94    | 72.30

urbandw311er 253 days ago
> we have provided three tools > bash shell
November 2024: AI is allowed to execute commands in a bash shell. What could possibly go wrong?
Hizonner 253 days ago
Can this solve CAPTCHAs for me? It's starting to get to the point where limited biological brains can't do them.
mercacona 253 days ago
I'm giving the new Sonnet a chance, although for my use as a writing companion so far, Opus has been king among all the models I've tried.
However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
[-]
- KoolKat23 253 days ago
  You're expectations could just be increasing as you start taking it for granted and are using other models.
freetonik 253 days ago
Fascinating. Though I expect people to be concerned about privacy implications of sending screenshots of the desktop, similar to the backlash Microsoft has received about their AI products. Giving the remote service actual control of the mouse and keyboard is a whole another level!
But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
[-]
- minimaxir 253 days ago
  The key difference is that Microsoft Recall wasn't opt-in.
- sharkjacobs 253 days ago
  There's such a gulf between choosing to send screenshots to Anthropic and Microsoft recording screenshots without user intent or consent.
- swalsh 253 days ago
  I suspect businesses will create VDI's or VM's for this express purpose. One because it scales better, and 2 because you can control what it has access to easier and isolate those functions.
- abrichr 253 days ago
  > I expect people to be concerned about privacy implications of sending screenshots of the desktop
  That's why in https://github.com/OpenAdaptAI/OpenAdapt we've built in several state-of-the-art PII/PHI scrubbers.
mmooss 253 days ago
Of course there's great inefficiency in having the Claude software control a computer with a human GUI mediating everything, but it's necessary for many uses right now given how much we do where only human interfaces are easily accessible. If something like it takes off, I expect interfaces for AI software would be published, standardized, etc. Your customers may not buy software that lacks it.
But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
[-]
- modeless 253 days ago
  I hope specialized interfaces for AI never happen. I want AI to use human interfaces, because I want to be empowered to use the same interfaces as AI in the future. A future where only AI can do things because it uses an incomprehensible special interface and the human interface is broken or non-existent is a dystopia.
  I also want humanoid robots instead of specialized non-humanoid robots for the same reason.
  [-]
  - accrual 253 days ago
    Maybe we'll end up with both, kind of like how we have scripting languages for ease of development, but we also can write assembly if we need bare metal access for speed.
  - torginus 253 days ago
    Imo, APIs and to a lesser extent cli tools are already specialized tools made for LLMs.
    I've been editing videos with ChatGPT4 + ffmpeg for a year now.
- accrual 253 days ago
  I agree, I bet models could excel at CLI tasks since the feedback would be immediate and in a language they can readily consume. It's probably much easier for them to to handle "command requires 2 arguments and only 1 was provided" than to do image-to-text on an error modal and apply context to figure out what went wrong.
turnsout 253 days ago
Wow, there's a whole industry devoted to what they're calling "Computer Use" (Robotic Process Automation, or RPA). I wonder how those folks are viewing this.
torginus 253 days ago
Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.
Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
[-]
- abrichr 253 days ago
  We use operating system accessibility APIs when available in https://github.com/OpenAdaptAI/OpenAdapt.
- accrual 253 days ago
  I wonder if the model has difficulties for the same reason some people do - UI affordability has gone down with the flattening, hover-to-see scrollbar, hamburger-menu-ization of UIs.
  I'd like to see a model trained on a Windows 95/NT style UI - would it have an easier time with each UI element having clearly defined edges, clearly defined click and dragability, unified design language, etc.?
  [-]
  - torginus 253 days ago
    What the UI looks like has no effect on for example, Windows UI Automation libraries. How the tech works is that it queries the process directly for the sematic description of items, like here's a button called 'Delete', here's a list of items for TODO's, and you get the tree structure directly from the API.
    I wouldn't be surprised if they are working off of screenshots, they still trained their models on having said screenshots annotated by said automation libraries, which told the AI what pixel is what.
- cherioo 253 days ago
  I think this is to make human /user experience better. If you use accessibility features, then user need to know how to use those features. Similar to another comment in here, the UX they shoot for is “click the red button with cancel on it”, and ship that ASAP.
sedatk 253 days ago
> developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text.
So, this is how AI takes over the world.
ford 253 days ago
Seems like both:
- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.
- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.
[-]
- skybrian 253 days ago
  It seems a little silly to pretend there’s a scaling “law” without plotting any points or doing a projection. Without the mathiness, we could instead say that new models keep getting better and we don’t know how long that trend will continue.
  [-]
  - ford 247 days ago
    "Law" might not be the right word - but there's no denying it's scaling with compute/data/model size. I suppose law happens after continued evidence over years.
  - ctoth 253 days ago
    > It seems a little silly to pretend there’s a scaling “law” without plotting any points or doing a projection.
    Isn't this Kaplan 2020 or Hoffmann 2022?
    [-]
    - skybrian 253 days ago
      Yes, those are scaling laws, but when we see vendors improving their models without increasing model size or training longer, they don't apply. There are apparently other ways to improve performance and we don't know the laws for those.
      (Sometimes people track the learning curve for an industry in other ways, though.)
jatins 253 days ago
How does the computer use work -- Is this a desktop app they are providing that can do actions on your computer? Didn't see any such mention in the post
[-]
- minimaxir 253 days ago
  Quickstart is here: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
- thundergolfer 253 days ago
  It’s a sandbox compute environment, using Gvisor or Firecracker or similar, which exposes a browser environment to the LLM.
  modal.com’s modal.Sandbox can be the compute layer for this. It uses Gvisor under the hood.
  [-]
  - dtquad 253 days ago
    Is there any Python/Node.js library to easily spawn secure isolated compute environments, possibly using gvisor or firecracker under the hood?
    This could be useful to build a self-hosted "Computer use" using Ollama and a multimodal model.
    [-]
    - eperot 252 days ago
      I have been [working on one](https://github.com/EtiennePerot/safe-code-execution)! The library is in [src/safecode/sandbox.py](https://github.com/EtiennePerot/safe-code-execution/blob/mas...).
- abrichr 253 days ago
  See https://github.com/OpenAdaptAI/OpenAdapt for an open source alternative that includes a desktop app.
- ZiiS 253 days ago
  It is a docker container providing a remote desktop you can see; they strongly recomend you also run it inside a VM.
Bjorkbat 253 days ago
Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I'm relieved to see it still can't do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It's arguably the most popular library for creating coding.
In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.
Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.
Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.
[-]
- nemothekid 253 days ago
  It's surprising how much knowledge is not easily googleable and can only unearched by deep diving into OSS or asking an expert. I recently was debugging a rather naive gstreamer issue where I was seeing a delay in the processing. ChatGPT, Claude and Google were all unhelpful. I spend the next couple days reading the source code, found my answer, and thought it was a bug.
  Asked the mailing list, and my problem was solved in 10 seconds by someone who could identify the exact parameter that was missing (and IMO, required some architecture knowledge on how gstreamer worked - and why the unrelatedly named parameter would fix it). The most difficult problems fall into this camp - I don't usually find myself reaching for LLMs when the problem is trivial unless it involves a mountain of boilerplate.
- cageface 253 days ago
  I recently asked it to give me code to do gapless playback of audio files using Apple's AVAudioEngine APIs. It got it wrong and additional prompts to explain why it was wrong didn't help.
  To me what it seems like these tools do really well is paraphrase stuff that's in their training data.
- cbhl 252 days ago
  "generate a voronoi diagram using p5js. use a vector algorithm to generate the diagram instead of a raster" seems to work for me?
Centigonal 253 days ago
They should just adopt Apple "version numbers:" Claude Sonnet (Late 2024).
mtgentry 253 days ago
What are the licensing implications of this? If I’m Google, I’d be pissed that my software is being used without a human there looking at the ads.
[-]
- SturgeonsLaw 253 days ago
  They just need to start tailoring their ads to things that bots might be interested in
- KoolKat23 253 days ago
  And today they added a new AI abuse clause to their t&C's lol.
flockonus 253 days ago
Are these ppl are aware that they can bump minor versions?
The mkt team vetoed Claude 3.6 ???
[-]
- sumedh 252 days ago
  Get ready for Claude 3.5 Pro Max.
bbor 253 days ago
Ok I know that we're in the post-nerd phase of computers, but version numbers are there for a reason. 3.6, please? 3.5.1??
runako 253 days ago
I really don't get their model. They have very advanced models, but the service overall seems to be a jumble of priorities. Some examples:
Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)
Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?
Really want to make Claude my daily driver, but right now it often feels too much like a research project.
[-]
- stuckkeys 253 days ago
  Even with API, depending what tier you are sitting on, there is daily limits. OpenAI used to be able to generate files for you, they changed that. It was useful.
  [-]
  - runako 253 days ago
    Interestingly enough, after Claude refused to generate a file for me, I sent the same request to ChatGPT and got the Excel file I wanted.
    I wasn't aware of tiers in the Claude API, they are not mentioned on the API pricing page. Are the limits disclosed or just based on vibes like they are for the chatbot?
- saaaaaam 253 days ago
  What do you mean by “file” here? I’m making files on a daily basis, including CSVs, html, executable code, XML, JSON and other formats. It built me an entire visual wireframe for something the other day.
  Are you using artefacts?
  But I’m maybe misunderstanding your point because my use is relatively basic through the built in chatbot.
  [-]
  - runako 253 days ago
    I asked it to generate a very basic Excel file. It generated text as Markdown. I reiterated that I want an Excel file with formulae and it provided this as part of its response:
    ----
    No, I am not able to generate or create an actual Excel file. As an AI language model, I don't have the capability to create, upload, or send files of any kind, including Excel spreadsheets.
    ----
    [-]
    - mathnmusic 253 days ago
      It does not have access to the Excel app. You may be able to generate the .xlsx file using python libraries but you would need to run the python code on your own. ChatGPT can run generated code, which is probably why it works there.
      [-]
      - runako 253 days ago
        I don't expect it to have the Excel app, I expect it to run the code it is capable of generating.
        This is what I mean by their strategy being a jumble. Claude can do the hard part of figuring out what code to write and writing it, but then refuses to do the easier part of executing the code.
        [-]
        simonw 253 days ago
        The Claude web UI cannot generate binary files, it's (currently) restricted to plain text.
        If you want binary files you'd be better off with ChatGPT with Code Interpreter mode, which can run Python code that generates binary content.
        Or ask Claude to write you Python code that generates Excel files and then copy and paste that onto your own computer and run it yourself.
        [-]
        runako 253 days ago
        > you'd be better off with ChatGPT with Code Interpreter mode
        Yes, this is what I am saying. Why go to the trouble to build something as capable as Claude and then hamstring it from being as useful as ChatGPT? I have no doubt that Claude could be more useful if the Anthropic team would let it shine.
        [-]
        simonw 253 days ago
        They've been investing engineering effort in Claude Artifacts instead, which I find incredibly useful: https://simonwillison.net/2024/Oct/21/claude-artifacts/
        I'd love to see them produce their own Code Interpreter alternative, but in the meantime it's open for third parties to offer that (and a few do).
        [-]
        runako 253 days ago
        I have used Artifacts a couple of times and found them useful.
        But now I am even more confused. They make an LLM that can generate code. They make a sandbox to run generated code. They will even host public(!) apps that run generated code.
        But what they will not do is run code in the chatbot? Unless the chatbot context decides the code is worthy of going into an Artifact? This is kind of what I mean by the offering being jumbled.
        BTW saw your writeup on the LLM pricing calculator -- very cool!
        [-]
        simonw 253 days ago
        Yeah I can't imagine Claude will be without a server-side code execution platform forever. Both OpenAI (Code Interpreter) and Gemini (https://ai.google.dev/gemini-api/docs/code-execution) have had that for a while now, and it's spectacularly useful. It fills a major gap in a Chatbot's skills too, since it lets them reliably run calculations.
        Sandboxing is a hard problem, but it's not like Anthropic are short on money or engineering talent these days.
        saaaaaam 251 days ago
        See, I think this is a case of personal preference. I much prefer Claude's approach of figuring out the code and writing it for me to execute myself, rather than having it all in one box. Apart from anything else, it makes me think a little more about the process, and the desired outcome, rather than just iterate, click, iterate, click.
        It's marginally less efficient, for sure, but it allows me greater visibility on the process, and gives me more confidence that whatever it's doing is what I want it to do.
        But maybe that's some weird luddite-ism on my part, and I should just embrace an even blacker box where everything is done in the tool.
        [-]
        runako 251 days ago
        YMMV obviously. If I ask the magic box to make a spreadsheet, I don't need to see the Python for that any more than I need to see the code it uses to summarize something I paste in. I don't really even care that it has to write code to make the spreadsheet at all.
    - tucnak 253 days ago
      This is claude.ai or console.anthropic.com?
      [-]
      - runako 253 days ago
        I believe Claude.ai, but I run as a Safari Web app because Anthropic doesn't ship a native app for MacOS (which I would prefer).
    - KoolKat23 253 days ago
      I think xlsx files are a proprietary Microsoft format.
- sumedh 252 days ago
  > Anthropic doesn't offer an unlimited chatbot service,
  because its expensive, if they give unlimited service someone will misuse it
  [-]
  - runako 251 days ago
    You misunderstood. Anthropic does not offer a paid Claude offering akin to ChatGPT Plus, where usage is effectively unlimited.
    [-]
    - sumedh 250 days ago
      OpenAI has more funding and more customers than Anthropic.
hubraumhugo 253 days ago
I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
[-]
- suchintan 253 days ago
  We're one of those players (https://github.com/Skyvern-AI/skyvern) and we're definitely watching the space with a lot of excitement
  We thought it was inevitable that OpenAI / Anthropic would veer into this space and start to become competitive with us. We actually expected OpenAI to do it first!
  What this confirms is that there is significant interest in computer / browser automation, and the problem is still unsolved. We will see whether the automation itself is an application later problem (our approach) or whether the model needs to be intertwined with the application (Anthropic's approach here)
joshuamcginnis 253 days ago
Is there anything out there yet that will let me issue the command:
> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.
Then I can just `git status` to see the changes?
[-]
- falcor84 253 days ago
  Aider is great at this stuff. The recommended way is to have it automatically commit, and then you can examine and possibly revert/reset its commits (or just have it work on a separate branch), but you can also use --no-auto-commits
- vipshek 253 days ago
  Install Cursor (https://cursor.com), go into Cursor Settings and disable everything but Claude, then open Composer (Ctrl/Cmd + I). Paste in your exact command above. I bet it’ll do something pretty close to what you’re looking for.
attentive 253 days ago
They need to work on their versioning.
"3.5 Sonnet (New)", WTAF? - just call it 3.6 Sonnet or something.
Is it "New" sonnet? is it "upgraded"? Is there a difference? How do I know which one I use?
I can understand claude-3-5-sonnet-20241022, but that's not what users see.
abc-1 253 days ago
I tried to get it to translate a document and it stopped after a few paragraphs and asked if I wanted it to keep going. This is not appropriate for my use case and it kept doing this even though I explicitly told it not to. The old version did not do this.
[-]
- graeme 253 days ago
  I noticed some timeouts today. Could be capacity limits from the announcement
lutusp 253 days ago
> "... and similar speed to the previous generation of Haiku."
To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".
In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
submeta 253 days ago
That’s too much control for my taste. I don’t want anthropic to see my screen. I rather prefer a VS Code with integrated Claude. A version that can see all my dev files in a given folder. I don’t need it to run Chrome for me.
[-]
- accrual 253 days ago
  It just depends on the task I suppose. One could have a VM dedicated to a model and let it control it freely to accomplish some set of tasks, then wipe/redeploy if it ever breaks.
  [-]
  - submeta 253 days ago
    Well, that’s another way of saying „not allowing it to see my screen“ ;)
bluelightning2k 253 days ago
This is what the Rabbit "large action model" pretended to be. Wouldn't be surprised to see them switch to this and claim they were never lying about their capabilities because it works now.
Pretty cool for sure.
[-]
- swalsh 253 days ago
  I think Rabbit had the business model wrong though, I don't think automating UI's to order pizza is anywhere near as valuable as automating the app workflows for B2B users.
RecycledEle 253 days ago
How long until it is profitable the tell a cheap AI to "win this game by collecting resources and advancing in-game" and then sell the account on eBay?
I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?
I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."
amai 252 days ago
Finally a general tool to solve captchas for my web scrapers.
wesleyyue 253 days ago
If anyone would like to try the new Sonnet in VSCode. I just updated https://double.bot to the new Sonnet. (disclaimer: I am the cofounder/creator)
---
Some thoughts:
* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.
* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?
gerash 253 days ago
The "computer use" demos are interesting.
It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.
What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?
throwaway0123_5 253 days ago
This is incredibly cool but it seems like the potential damage from a "hallucination" in this mode is considerable, especially when they provide examples of it going very far off-track (looking up Yellowstone pictures). Would basically need constant monitoring for me not to be paranoid it did something stupid.
Also seems like a privacy issue with them sending screenshots of your device back to their servers.
maestrae 253 days ago
anybody know how the hell they're combating / gonna combat captcha's, cloudflare blocking, etc. I remember playing in this space on a toy project and being utterly frustrated by anti-scraping. Maybe one good thing that will come out of this AI boom is that companies will become nicer to scrapers? Or maybe, they'll just cut sweetheart deals?
29decibel 253 days ago
I am surprised it uses macOS as the demo, as I thought it would be harder to control vs Ubuntu. But maybe at the same time, macOS is the most predictable/reliable desktop environment? I noticed that they use virtual environment for the demo, curious how do they build that along with docker, is that leveraging the latest virtualization framework from Apple?
Tepix 253 days ago
Interesting stuff, i look forward to future developments.
A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.
msoad 253 days ago
I skimmed through the computer use code. It's possible to build this with other AI providers too. For instance you can asks ChatGPT API to call functions for click and scroll and type with specific parameters and execute them using OS's APIs (A11y APIs usually)
Did I miss something? Did they have to make changes to the model for this?
[-]
- accrual 253 days ago
  > execute them using OS's APIs (A11y APIs usually)
  I wonder if we'll end up with a new set of AI APIs in Windows, macOS, and Linux in the future. Maybe an easier way for them to iterate through windows and the UI elements available in each.
  [-]
  - jlpom 252 days ago
    It already exists for KDE: https://community.kde.org/Selenium
fernly 252 days ago
Imagine the possibilities for cyber-crime. Surely you could program it to log in to a financial institution and transfer money. And if you had a list of user names and passwords from some large info breach? You could automate a LOT of transfers in a short amount of time...
tammer 253 days ago
This demo is impressive although my initial reaction is a sort of grief that I wasn't born in the timeline where Alan Kay's vision of object-oriented computing was fully realized -- then we wouldn't have to manually reconcile wildly heterogeneous data formats and interfaces in the first place!
[-]
- aryehof 253 days ago
  Vision of a “universal communicator” rather than OO?
aprilthird2021 252 days ago
OpenAI must be scared at this point. Anthropic is clobbering them at the high end of the market and Meta is providing free AIs at the low end. OpenAI is pretty soon going to be in the valueless middle fighting with tons of other companies for relevance
alok-g 252 days ago
Next stop after 'Computer Use' -- Multimodal input from a robot's sensors and generating various signals to control its actions.
Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.
myprotegeai 253 days ago
How long until "computer use" is tricked into entering PII or PHI into an attackers website?
[-]
- accrual 253 days ago
  I imagine initial computer use models will be kind of like untrained or unskilled computer users today (for example, some kids and grandparents). They'll do their best but will inevitably be easy to trick into clicking unscrupulous links and UI elements.
  Will an AI model be able to correctly choose between a giant green "DOWNLOAD NOW!" advertisement/virus button and a smaller link to the actual desired file?
  [-]
  - myprotegeai 253 days ago
    Exactly. Personalized ads are now prompt injection vectors.
wewtyflakes 253 days ago
I wonder if OpenAI will fast follow; usually they're the ones to throw down the gauntlet. That being said, you can play around with OpenAI with a similar architecture of vision + agent + exec + loop using Donobu, though it is constrained to web browsers.
lairv 253 days ago
Offtopic but youtube doesn't allow me to view the embedded video, with a "Sign in to confirm you’re not a bot" message. I need to open a dedicated youtube tab to watch it
The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
[-]
- ALittleLight 253 days ago
  That's funny. I was recently scraping tens of thousands of YouTube videos with yt-dlp. I would encounter throttling of some kind where yt-dlp stopped working, but I'd just spin a new VPS up and the throttled VPS down when that happened. The throttling effort cost me ~1 hour of writing the logic to handle it.
  I say that's funny because my guess would be they want to block larger scale scraping efforts like mine, but completely failed, while they attempt at throttling puts captchas in front of legitimate users.
abraxas 253 days ago
Hopefully the coding improvements are meaningful because I find that as a coding assistant o1-preview beats it (at least the Claude 3.5 that was available yesterday) but I like Claude's demeanor more (I know this sounds crazy but it matters a bit to me)
nwnwhwje 253 days ago
Any comments on alignment with Anthropic's missions. Last time I checked Anthropic is about building SOTA as that is the only way to do safety research. Making money ans useful stuff commercially is a means to an end.
smcleod 253 days ago
I wonder when it'll actually be available in the Bedrock AU region, because as of right now we're still stuck using mid-range models from a year ago.
Amazon has really neglected ap-southeast-2 when it comes to LLMs.
[-]
- dheerkt 253 days ago
  Can you not use cross-region inference?
  [-]
  - smcleod 253 days ago
    90% of our customers do not allow this due to data sovereignty.
    Bedrock here is lagging so far behind several customers assume AWS simply aren't investing here anymore - or if they are it's an afterthought - and a very expensive one at that.
    I've spoken with several account managers and SAs and they seem similarly frustrated with the continual response from above that useful models are "coming soon".
    You can't even BYO models here, we usually end up spinning up big ol' GPU EC2 instances and serving our own, or for some tasks running locally as you can get better openweight LLMs.
    [-]
    - dheerkt 246 days ago
      Hmm interesting, didn't realize that data sovereignty requirements were so stringent. Wonder how other cloud providers are doing in this sense considering GPU shortages across the board.
brid 253 days ago
Looks like visual understanding of diagrams is improved significantly! For example, it was on par with Chat GPT 4o and Gemini 1.5 in parsing an ERD for a conceptual model, but now far excels over the others.
bergutman 253 days ago
They need to get the price of 3.5 Haiku down. It's about 2x 4o-mini.
[-]
- quotemstr 253 days ago
  Still super cheap
  [-]
  - caeril 253 days ago
    Precisely this.
    Aider (with the older Claude models) is already a semi-competent junior developer, and it will produce 1kloc of decent code for the equivalent of 50 cents in API costs.
    Sure, you still have to review the commits, but you have to do that anyway with human junior developers.
    Anthropic could charge 20x more and we would still be happy to pay it.
m3kw9 253 days ago
I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
[-]
- abrichr 253 days ago
  See https://github.com/OpenAdaptAI/OpenAdapt for an open source alternative that runs segmentation locally.
- accrual 253 days ago
  Maybe we need some agent running on the PC to offload some of these tasks. It could scrape the display at 30 or 60 Hz and produce a textual version of what's going on for the model to consume.
thecolorgreen 253 days ago
This looks really similar to rabbit's Large Action Model (LAM). Cool!
https://www.rabbit.tech/rabbit-os

TacticalCoder 253 days ago

One suggestion, use the following prompt at a LLM:

    The combination of the words "computer use" is highly confusing. It's also "Yoda speak". For example it's hard for humans to parse the sentences *"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku"*, *"Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku "* (it literally relies on the comma to make any sense) and *"Computer use for automated interaction"* (in the youtube vid's title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.

amai 252 days ago
This "computer use" feature is obviously perfect for automating GUI tests. Will it work on screenshots of mobile devices like smartphones/tables, also?
[-]
- gonzan 252 days ago
  I think it’s too expensive for that at this point
Maynor 249 days ago
Join PeachLive and input my invitation code 6B94HL to get 20 free coins! Enjoy live video chat at {invitationUrl}
kingkongjaffa 253 days ago
Interestingly new claude only knows content up to:
> I'm limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.
mclau156 253 days ago
Did they just invent a new world of warcraft or runescape bot?
punnerud 253 days ago
Cursor AI already have the option to switch to using claude-3-5-sonnet-20241022 in the chat box.
I was about to try to add a custom API. I’m impressed by the speed of that team.
[-]
- neevans 253 days ago
  It's literally just adding one extra entry to a configuration file.
  [-]
  - punnerud 253 days ago
    I know, but similar updates to Copilot would probably take over a year and they designed it in a way that we got the update now without having to reinstall it.
Alifatisk 253 days ago
> Claude 3.5 Haiku matches the performance of Claude 3 Opus
Oh wow!
lostmsu 252 days ago
Can anyone share a .http or curl or anything similar based session with computer tool use? Docker containers make me cry.
taytus 253 days ago
Computer use won't allow you to log in to social media accounts, even if it is your account and credentials. Bummer.
robertkoss 253 days ago
Does anyone know how I could check whether my Claude Sonnet version that I am using in the UI has been updated already?
[-]
- nilsherzig 252 days ago
  The ui shows a (new) next to the model name for me (free user, Germany)
- lambdaba 253 days ago
  search for "20241022" in network tab in devtools, confirmed for me
crazystar 253 days ago
Looks like it just takes a screenshot and can't scroll so it might miss things.
Claude 3.5 Haiku will be released later this month.
[-]
- freetonik 253 days ago
  It can actually scroll.
  [-]
  - crazystar 253 days ago
    While we expect this capability to improve rapidly in the coming months, Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.
    [-]
    - artur_makly 253 days ago
      Can someone please try this on a MAC/OS and just 100% verify if this puppy can scroll or not? thnks
      [-]
      - nilsherzig 252 days ago
        It does in the video. Just not the spreadsheet at the start.
myprotegeai 253 days ago
We are approaching FSD for the computer, with all of the lofty promises, and all of the horrible accidents.
iamsanteri 252 days ago
I love how they don't seem to be calling it "AgenticAI" or something like that.
throwvc3 253 days ago
What I'd like to know is whether prompt caching is available to Claude on AWS Bedrock now.
vivekkairi 253 days ago
aider benchmarks for claude 3.5 new are impressive. From 77.4% to 83.5% beating o1-preview.
netcraft 253 days ago
since they didnt rev the version, does this mean if we were using 3.5 today its just automatically using the new version? That doesnt seem great from a change management perspective
though I am looking forward to using the new one in cursor.ai
[-]
- minimaxir 253 days ago
  No, Claude's models use date-pinning. The new model endpoint is claude-3-5-sonnet-20241022
  https://docs.anthropic.com/en/docs/about-claude/models
2-3-7-43-1807 253 days ago
wow, i almost got worried but the cute music and the funny little monster on the desk convinced me that this all just fun and dandy and all will be good. the future is coming and we'll all be much more happy :)
bilsbie 253 days ago
Does this make cursor obsolete?
You can just use any IDE you want and it will work with it.
[-]
- jusgu 253 days ago
  Assuming running this new computer interactivity feature is as fast as cursor composer (which I don’t think it is)—it still doesn’t support codebase indexing, inline edits or references to other variables and files in the codebase. I can see how someone could use this to make some sort of cursor competitor but out of the box there’s a very low likelihood it makes cursor obsolete.
  [-]
  - 93po 253 days ago
    i really want cursor to integrate this so it can look at the results of a code change in the browser and then make edits as needed until it's accomplished what i asked of it. same for errors in the console etc. right now i have to manually describe the issue or copy and paste the error message and it'd be nice for it to just iterate more on its own
veggieWHITES 253 days ago
While I was initially impressed with it's context window, I got so sick of fighting with Claude about what it was allowed to answer I quit my subscription after 3 months.
Their whole policing AI models stance is commendable but ultimately renders their tools useless.
It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/
[-]
- r2_pilot 253 days ago
  I just include text that I own the device in question and that I have a legal team watching my every move. It's stupid, I agree, but not insurmountable. I had less refusals with Claude 3 Opus.
brcmthrowaway 253 days ago
This is bad news for SWEs!
esseti 253 days ago
I checked the docs but did not find it out. Cloude has API as the GPT Assistant? with also the ability to give a set of documents to work with?
It seems that you can only send single message, thus not relying on the ability to "learn" from predefined documents.
tylerchilds 253 days ago
computer use is really going to highlight how fragmented the desktop ecosystem is, but also this definitely paints more context on how microsoft wants to use their screenshot ai
iknownthing 253 days ago
Can Claude create and run a CI/CD pipeline now from a prompt?
jonesn11 252 days ago
How does one get access to it without using the API??
ta93754829 253 days ago
eventually, we'll be able to eliminate the intermediate "computer", and just let the ai render everything we need to interact with
efields 253 days ago
Captchas are toast.
[-]
- edm0nd 253 days ago
  they have been toast for at least a decade if not two now. With OCR and captcha solving services like DeathByCaptcha or AntiCaptcha where it costs ~$2.99 per 1k successfully solved captchas, they are a non-issue amd takes about 5-10 lines of code added to your script to implement a solution.
netcraft 253 days ago
im unclear, is haiku supposed to be similar to 4o-mini in usecase/cost/performance? If not, do they have an analog?
[-]
- machiaweliczny 253 days ago
  Probably better than 4o-mini, 4o-mini isn’t great from my testing. loses focus after 100 lines of text
  [-]
  - usaar333 253 days ago
    It's roughly tied in benchmarks
ta8645 252 days ago
Still can't use their services. They still require a phone number for some reason. What about those of us who don't have one?
[-]
- exdsq 252 days ago
  Don’t have a phone but want to write code interfacing with a paid LLM? How often does that happen?
- phito 252 days ago
  > What about those of us who don't have one?
  Why should they care about losing those ~10 potential users?
- szundi 252 days ago
  Buy a sim card, fixed.
  [-]
  - ta8645 252 days ago
    I'd have to also buy a phone, and i'm not doing that for a single service. It's a ridiculous requirement, that people are far too willing to accept.
mathiasrw 253 days ago
Just to confirm: did they just release a model with the exact same name as the previous one?
jerrygoyal 253 days ago
does anyone know what are some use cases for "computer use"?
nbzso 252 days ago
Just a question: For this thingy to work, I must give the provider access to my computer? Good luck. :)
Just another reason to use ONLY local LLM's.
[-]
- jviotti 251 days ago
  This. There is no way I would trust any AI provider to pretty much have full control over my computer.
geniium 252 days ago
This is amazing
g9yuayon 253 days ago
Is it just me who feels that Anthropic has been innovating faster than ChatGPT in the past year?
Maynor 249 days ago
6B94HL
postalcoder 253 days ago
and i was just planning to go to sleep…
[-]
- accrual 253 days ago
  I discovered Mindcraft recently and stayed up a few hours too late trying to convince my local model to play Minecraft. Seems like every time a new capability becomes available, I can't wait to experiment with it for hours, even at the cost of sleep.
dtquad 253 days ago
Now I am really curious how to programmatically create a sandboxed compute environment to do a self-hosted "Computer use" and see how well other models, including self-hosted Ollama models, can do this.
mannycalavera42 252 days ago
new VBA version just landed
anotherpaulg 253 days ago
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).
```
  84% Claude 3.5 Sonnet 10/22
  80% o1-preview
  77% Claude 3.5 Sonnet 06/20
  72% DeepSeek V2.5
  72% GPT-4o 08/06
  71% o1-mini
  68% Claude 3 Opus
```
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!
```
  92% Sonnet 10/22
  75% o1-preview
  72% Opus
  64% Sonnet 06/20
  49% GPT-4o 08/06
  45% o1-mini
```
https://aider.chat/docs/leaderboards/
[-]
- gloosx 253 days ago
  I will repeat my question from one of the previous threads:
  Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?
  Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?
  [-]
  - carschno 252 days ago
    Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229
  - jo909 252 days ago
    > Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?
    They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.
    If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?
    I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.
    [-]
    - gloosx 252 days ago
      It sounds very nice, but at the same time very naive, sorry. Funding is not a gift, and they must make money. The more funding they get - the more pressure there is to make money.
      When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.
      And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.
    - youoy 252 days ago
      > If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?
      You should already know by now that economic incentives are not always aligned with science/knowledge...
      This is the true alignment problem, not the AI alignment one hahaha
      [-]
      - concordDance 252 days ago
        The AI alignement problem and the people alignment problem are actually the same problem! :D
        One is just a bit harder due to the less familiar mind "design".
    - carschno 252 days ago
      They cannot be found out as long as there is no better evaluation. Sure, if they produce obvious nonsense, but the point of a systematic evaluation is exactly to overcome subjective impressions based on individual examples as a notion of quality.
      Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.
      And again, as long as there is no better evaluation method, you still won't know how much it really helps.
    - KeplerBoy 252 days ago
      This market is all about hype and mindshare, proper testing is hard and not performed by individuals, so there are no incentives not to train a bit on the test set.
    - gershy 252 days ago
      And if there is a board that will fire you if expected profits do not increase, do you still maintain this stance?
  - bilekas 253 days ago
    > Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?
    Yes, this is an inherit problem with the whole idea of LLM's. They're pattern recognition "students" but the important thing, that all the providers like to sell is their reasoning. A good test is a reasoning test. I'll try to find a link and update with a reference.
  - Lucasoato 252 days ago
    There is an opportunity to develop black-box benchmarks and offer them to LLM providers to support their testing phase. If I were in their place, I would find it incredibly valuable to have such tamper-proof testing before releasing a model.
  - gloosx 248 days ago
    Conveniently, author of these benchmarks remains silent on topic every time. Think about it :)
- artemisart 253 days ago
  Thanks! I was waiting for your benchmarks. Do you plan to test haiku 3.5 too? It would be nice to show API prices needed to run the whole benchmark too to have a better idea of how many internal tokens o1 models consume.
- ianeigorndua 253 days ago
  Are these synthetic or real-world benchmarks?
  Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”
  Not gonna start looking for a job any time soon
  [-]
  - zeroonetwothree 253 days ago
    Example I chose at random:
    > Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).
    So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.
    [-]
    - Vampiero 253 days ago
      Yeah but programming isn't about solving problems that were solved millions of times already. I mean, web dev kind of is, but that's not the point. If a problem is solved, then it's just a matter of implementing the solution and anyone can do that given the proper instructions (even without understanding how or why they solve the problem).
      I've formalized a lot of stuff I didn't understand just by copying the formulas from Wikipedia.
      As long as LLMs are not capable of proper reasoning, they will remain a gimmick in the context of programming.
      They should really just focus on refactoring benchmarks across many languages. If an AI can refactor my complex code properly without changing the semantics, it's good enough for me. But that unfortunately requires such a high-level understanding of the codebase that with the current tech it's just impossible to get a half-decent result in any real-world scenario.
  - stavros 253 days ago
    I use Claude for coding and it's fantastic. I definitely have outsourced a lot of my coding to it.
    [-]
    - seriocomic 253 days ago
      What's the (current) best way to integrate it? VS Code extension? Other IDE?
      [-]
      - codebolt 253 days ago
        I'll throw this out here as well: Is there any decent alternative to GitHub Copilot when using Visual Studio? (Pretty happy with it to be fair, but would be open to trying others.)
        [-]
        vramana 253 days ago
        Supermaven is really good. I am a paying user of super maven.
      - RobinL 253 days ago
        I use cursor (cursor.com) and it's fantastic
        [-]
        gershy 252 days ago
        Fellow cursor user here, I'm very new to it. I am getting some very convenient and welcome autocomplete. I am also getting quite a lot of bad autocomplete suggestions, which require cognitive overhead and context switching to evaluate. So I am thus far not fully convinced. Any tips for getting the most out of cursor?
        IanCal 252 days ago
        Huge seconding of cursor.
      - maeil 253 days ago
        Aider, created by the originator of this very comment thread.
      - Squarex 253 days ago
        Sourcegraph Cody.
    - ianeigorndua 253 days ago
      Sure it can do coding but can it do software engineering
      [-]
      - p1esk 253 days ago
        What exactly is left when we remove coding from software engineering? Could it be handled by a manager? Or perhaps by a single senior SWE who could now perform the work of an entire team using these rapidly advancing AI coders?
        [-]
        findjashua 253 days ago
        for a lot of tasks that aren't as cut & dry, i often find myself having to provide it pseudo code, which it can then one-shot to working code.
        don't get me wrong, it's still a massive upgrade from the pre-sonnet era, but i still don't think it can take a high-level requirement and convert it into a working project... yet
        [-]
        anonzzzies 253 days ago
        > but i still don't think it can take a high-level requirement and convert it into a working project.
        It cannot, you need to hand-hold it, as in, to make something larger than a (albeit good looking) to do app, you don't need to write code , but you do need to be able to review and debug code and take the architectural decisions. It'll simply loop forever otherwise.
      - gorgoiler 253 days ago
        It’s a good question. I would ask…
        (1) Sure, it can tell you how to write new code in response to a prompt about your current local problem, but
        (2) can it reason about an entire code base of known and unknown problems, and use that basis to figure out solutions to the unknowns such that you delete code and collapse complexity.
        The software equivalent of realising that if you subtract xy from this:
        x2 + 3xy + y2
        You can turn it into a much neater version:
        (x + y)2 + xy
        …but doing that with 100k tokens of code instead of a handful of algebra tokens.
      - lukasb 253 days ago
        I haven't had much luck with architecture stuff. Maybe I'm holding it wrong.
    - spaceman_2020 253 days ago
      The new version is already in Cursor and its outstanding.
      Can code at mid-level now. Almost.
- miki123211 253 days ago
  When using these models via the official Anthropic API, do I have to do anything to "opt in" to the new Sonnet, or am I switched over automatically?
  [-]
  - simonw 253 days ago
    That depends on the model ID you are using.
    If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.
    If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.
- usaar333 253 days ago
  FWIW, the refactor benchmark is quite mechanical - it just stresses reliability of LLMs over long context windows:
  Questions are variants of:
  Refactor the _set_csrf_cookie method in the CsrfViewMiddleware class to be a stand alone, top level function. Name the new function _set_csrf_cookie, exactly the same name as the existing method. Update any existing self._set_csrf_cookie calls to work with the new _set_csrf_cookie function.
  [-]
  - jstummbillig 253 days ago
    Assuming that that is indeed what most of the benchmark does: If the LLMs are as bad as it as the numbers suggest, then it seems like a perfectly good benchmark. I would definitely want them to be able to do stuff like that when I let them write my code.
- faizshah 253 days ago
  Anecdotally but I still get significantly better results from ChatGPT than claude for coding.
  Claude is way less controllable it is difficult to get it to do exactly what I want. ChatGPT is way easier to control in terms of asking for specific changes.
  Not sure why that is maybe the chain of thought and instruction tuning dataset has made theirs a lot better for interactive use.
  [-]
  - anonzzzies 253 days ago
    For me it's the opposite; chatgpt (o1 preview and 4o) keep making very strange errors; errors that I even exactly tell it how to fix and it simply repeats the fundamental mistakes again. With claude, I did not have that.
    Example; I asked it to write some js that finds a button on a page, clicks the button, then waits for a new element with some selector to appear and return a ref to it; chatgpt kept returning (pseudo code);
    while (true) {
    button.click()
    wait()
    oldItems = ...
    newItems = ...
    newItem = newItems - oldItems
    if (newItem) return newItem
    sleep(1)
    }
    which obviously doesn't work. Claude understands to put the oldItems outside the while; even when I tell chatgpt to do that, it doesn't. Or it does one time and with another change, it moves it back in.
  - bigs 253 days ago
    Try as I might, ChatGPT couldn’t give me working code for a simple admin dash layout in Vue with a sidebar than can minimise. I had to correct it, it would say my apologies and provide new code with a different error. About 10 times in a row it got in a loop of errors and I gave up.
    Do any of these actually help coding?
    [-]
    - cryptoz 253 days ago
      Prompting is a skill you can develop with practice and get better at. Also, some tasks just aren’t going to work well for various reasons.
      Yes, LLMs can actually help with coding. But it’s not magic. There are limits. And you get better with practice.
      [-]
      - brigandish 253 days ago
        Without people providing their prompts, it's impossible to say whether they are skilled or not, and their complaints or claims of "it worked with this prompt" without the output are also not possible to validate.
        Maybe there's a clue in there as to why these experiences seem so different. I'm glad GPTs don't get frustrated.
        [-]
        simonw 253 days ago
        I have a personal policy of sharing my prompts as openly as possible. I've shared hundreds at this point - for a bunch of recent examples see https://simonwillison.net/2024/Oct/21/claude-artifacts/ and https://simonwillison.net/tags/ai-assisted-programming/
        cheevly 253 days ago
        Ive spent thousands of hours, literally, learning the ropes, and continue to hone it. There is a much higher skill ceiling for prompting than there was for Google-fu.
        [-]
        mraza007 253 days ago
        Back in the day googling was a skill not with the Rise of LLMs Prompting is a skill
        HPsquared 248 days ago
        Literally ropes as in RoPE, rotary positional embeddings?
    - simonw 253 days ago
      Give it one or two examples of what you want. Don't expect these things to perfectly solve every problem - they're transformation machines, so they can do pretty much anything if you figure out the right input.
    - ipaddr 253 days ago
      Just tried it and it worked. Try this:
      give me a vue js page. I want a sidebar that minimizes (if triggered). Make simple admin placeholder page.
      [-]
      - bigs 253 days ago
        This was about 6 months ago I think. I’ll happily give it another shot.
  - csomar 253 days ago
    Maybe it's relative? Claude beats GPT-4/o by a far margin for me but I am mostly using them for Rust.
    [-]
    - kqr 253 days ago
      I also think there are subtle differences in how models like to be prompted, so some people will have more luck with one type of model.
theflyestpilot 252 days ago
cries in UiPath
HanClinto 253 days ago
Why not rev the numbers? "3.5" vs. "3.5 New" feels weird -- is there a particular reason why Anthropic doesn't want to call this 3.6 (or even 3.5.1)?
[-]
- abeppu 253 days ago
  The confusing choice they seem to have made is that "Claude 3.5 Sonnet" is a name, rather than 3.5 being a version. In their view, the model "version" is now `claude-3-5-sonnet-20241022` (and was previously `claude-3-5-sonnet-20240620`).
  https://docs.anthropic.com/en/docs/about-claude/models
  [-]
  - dragonwriter 253 days ago
    OpenAI does exactly the same thing, by the way; the named models also have dated versions. For instance, there current models include (only listing versions with more than one dated version for the same "name" version):
```
  gpt-4o-2024-08-06 
  gpt-4o-2024-05-13
  gpt-4-0125-preview
  gpt-4-1106-preview
  gpt-4-0613
  gpt-4-0314
  gpt-3.5-turbo-0125
  gpt-3.5-turbo-1106
```
    [-]
    - coder543 253 days ago
      On the one hand, if OpenAI makes a bad choice, it’s still a bad choice to copy it.
      On the other hand, OpenAI has moved to a naming convention where they seem to use a name for the model: “GPT-4”, “GPT-4 Turbo”, “GPT-4o”, “GPT-4o mini”. Separately, they use date strings to represent the specific release of that named model. Whereas Anthropic had a name: “Claude Sonnet”, and what appeared to be an incrementing version number: “3”, then “3.5”, which set the expectation that this is how they were going to represent the specific versions.
      Now, Anthropic is jamming two version strings on the same product, and I consider that a bad choice. It doesn’t mean I think OpenAI’s approach is great either, but I think there are nuances that say they’re not doing exactly the same thing. I think they’re both confusing, but Anthropic had a better naming scheme, and now it is worse for no reason.
      [-]
      - dragonwriter 253 days ago
        > Now, Anthropic is jamming two version strings on the same product, and I consider that a bad choice. It doesn’t mean I think OpenAI’s approach is great either, but I think there are nuances that say they’re not doing exactly the same thing
        Anthropic has always had dated versions as well as the other components, and they are, in fact, doing exactly the same thing, except that OpenAI has a base model in each generation with no suffix before the date specifier (what I call the "Model Class" on the table below), and OpenAI is inconsistent in their date formats, see:
        Major Family Generation Model Class Date claude 3.5 sonnet 20041022 claude 3.0 opus 20240229 gpt 4 o 2024-08-06 gpt 4 o-mini 2024-07-18 gpt 4 - 0613 gpt 3.5 turbo 0125
        [-]
        coder543 253 days ago
        But did they ever have more than one release of Claude 3 Sonnet? Or any other model prior to today?
        As far as I can tell, the answer is “no”. If true, then the fact that they previously had date strings would be a purely academic footnote to what I was saying, not actually relevant or meaningful.
- nisten 253 days ago
  For a company selling intelligence, that's a pretty stupid way of labelling a new product.
  [-]
  - riffraff 253 days ago
    "computer use" is also as bad a marketing choice as possible for something that actually seems pretty cool.
    [-]
    - pglevy 253 days ago
      I had no idea what the headline meant before reading the article. I wasn't even sure how to pronounce "use." (Maybe a typo?) I think something like "Claude adds Keyboard & Mouse Control" would be clearer.
      [-]
      - barrell 253 days ago
        I read the headline 5-10 times trying to make sense of it before even clicking on the link.
        Native English speaker, just used the other “use” many times
    - accrual 253 days ago
      I'm not sure what a better term is. It's kind of understated to me. An AI that can "use a computer" is a simple straightforward sentence but with wild implications.
      [-]
    - ok_dad 253 days ago
      It’s simple and easy to understand what it is, that’s good marketing to my ears.
    - swyx 253 days ago
      it makes sense in contrast to "tool use". basically, either fly-by-vision or fly-by-instruments, same dilemma you have in self driving cars
  - dartos 253 days ago
    It worked for Nintendo.
    The 3ds and “new 3ds” were both big sellers.
    [-]
    - Zambyte 253 days ago
      3ds doesn't have a version number to bump. Claude 3.5 does.
      [-]
      - cooper_ganglia 253 days ago
        I hear the Nintendo 4DS was very popular with the higher dimensional beings!
      - dartos 253 days ago
        The 3 was the version number ;)
        Ds and ds lite were version 1
        Dsi was 2 (as there was dsi software that didn’t run on ds or ds lite)
        And the 3ds was version 3.
        [-]
        Zambyte 244 days ago
        https://www.perplexity.ai/search/what-is-the-meaning-of-the-...
        kurisufag 253 days ago
        there /was/ a 2DS, though, and it came after the 3DS.
      - r00fus 253 days ago
        You can always add a version number (e.g. 3DS2) or a changed moniker (3DS+).
  - dragonwriter 253 days ago
    Every major AI vendor seems to do it with hosted models; within "named" major versions of hosted models, there are also "dated" minor versions. OpenAI does it. Google does it (although for Google Gemini models, the dated instead of numbered minor versions seem to be only for experimental versions like gemini-1.5-pro-exp-0827, stabled minor versions get additional numbers like gemini-1.5-pro-002.)
  - quantadev 253 days ago
    Speaking of "intelligence", isn't it ironic how everyone's only two words they use to describe AI is "crazy" and "insane". Every other post on Twitter is like: This new feature is insane! This new model is crazy! People have gotten addicted to those words almost as badly as their other new addiction: the word "banger".
    [-]
    - fragmede 253 days ago
      Well yeah. This new model is mentally unwell! and This model is a total sociopath! didn't test as well in focus groups.
- HarHarVeryFunny 253 days ago
  Well, by calling it 3.5, they are telling you that this is NOT the next-gen 4.0 that they presumably have in the works, and also not downplaying it by just calling it 3.6 (and anyways they are not advancing versions by 0.1 increments - it seems 3.5 was just meant to convey "half way from 3.0 to 4.0"). Maybe the architecture is unchanged, and this just reflects more pre and/or post-training?
  Also, they still haven't released 3.5 Opus yet, but perhaps 3.5 Haiku is a distillation of that, indicating that it is close.
  From a competitive POV, it makes sense that they respond to OpenAI's 4o and o1 without bumping the version to Claude 4.0, which presumably is what they will call their competitor to GPT-5, and probably not release until GPT-5 is out.
  I'm a fan of Anthropic, and not of OpenAI, and I like the versioning and competitive comparisons. Sonnet 3.5 still best coder, better than o1, has to hurt, and a highly performant cheap Haiku 3.5 will hit OpenAI in the wallet.
- therealmarv 253 days ago
  exactly my thought too, go up with the version number! Some negative examples: Claude Sonnet 3.5 for Workstations, Claude Sonnet 3.5 XP, Claude Sonnet 3.5 Max Pro, Claude Sonnet 3.5 Elite, Claude Sonnet 3.5 Ultra
  [-]
  - xnx 253 days ago
    Claude Sonnet 3.5 360, Claude Sonnet 3.5 One
  - scrlk 253 days ago
    2007: "Choose a Vista" - https://www.youtube.com/watch?v=5-feCRQBkSs
    2024: "Choose a Claude"?
  - r00fus 253 days ago
    Super Claude Sonnet 3.5 Champion Edition, Alpha 3
- oezi 253 days ago
  Let's just say that the LLM companies still are learning how to do versioning in a customer friendly way.
- GaggiX 253 days ago
  Similar to OpenAI when they update their current models they just update the date, for example this new Claude 3.5 Sonnet is "claude-3-5-sonnet-20241022".
- afro88 253 days ago
  Just guessing here, but I think the name "sonnet" is the architecture, the number is the training structure / method, and the model date (not shown) is the data? So presumably with just better data they improved things significantly? Again, just a guess.
- KaoruAoiShiho 253 days ago
  My guess is they didn't actually change the model, that's what the version number no change is conveying. They did some engineering around it to make it respond better, perhaps more resources or different prompts. Same cutoff date too.
- m3kw9 253 days ago
  Maybe they notice 3.5 Sonnet has become a brand and pivot it away from a version
  [-]
  - sureIy 253 days ago
    Is it OS X all over again?
- pella 253 days ago
  claude-3-5-sonnet-20241022
  [-]
  - moffkalast 253 days ago
    claude-3-5-sonnet-20241022-final-final-2
- bloedsinnig 253 days ago
  Because its a finetune of 3.5 optimized for the use case of computer use.
  Its actually accurate and its not a 3.6.
  [-]
  - therealmarv 253 days ago
    So 3.5.1 ?
    [-]
    - dotancohen 253 days ago
      I think that was the last version number for KDE 3.
      Stands out for me as I once replaced a 2.3 Turbo in a TurboCoupe with a 351 Windsor ))
    - r00fus 253 days ago
      For networks
  - usaar333 253 days ago
    I don't think that's correct. This looks like a new model. Significant jump in math and gpqa scores.
    [-]
    - diggan 253 days ago
      If the architecture is the same, and the training scripts/data is the same, but the training yielded slightly different weights (but still same model architecture), is it a new model or just a iteration on the same model?
      What if it isn't even a re-training from scratch but a fine-tune of an existing model/weights release, is it a new version then? Would be more like a iteration, or even a fork I suppose.
      [-]
      - cooper_ganglia 253 days ago
        Yes, it's a new model, but not a Claude 4.
        It's the same, but a bit different; Claude 3.6 makes sense to me.
    - bloedsinnig 252 days ago
      I would assume that 3.5 means that the base training (which takes weeks/month) wasn't changed but only finetuning happened.
    - HarHarVeryFunny 253 days ago
      Could be just additional post-training (aka finetuning) for coding/etc.
jampekka 253 days ago
It's quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
[-]
- abrichr 253 days ago
  See https://github.com/OpenAdaptAI/OpenAdapt for an open source alternative that includes operating system accessibility API data and DOM information (along with bitmaps) where available.
  We are also planning on extracting runtime information using COM/AppleScript: https://github.com/OpenAdaptAI/OpenAdapt/issues/873
  [-]
  - accrual 253 days ago
    It's super cool to see something like this already exists! I wonder if one day something adjacent will become a standard part of major desktop OSs, like a dedicated "AI API" to allow models to connect to the OS, browse the windows and available actions, issue commands, etc. and remove the bitmap parsing altogether as this appears to do.
- HarHarVeryFunny 253 days ago
  It's really more of a reflection on where we're at in the timeline of computing, with humans having been the major user of apps and webs site up until now. Obviously we've had screen scraping and terminal emulation access to legacy apps for a while, and this is a continuation of that.
  There have been, and continue to be, computer-centric ways to communicate with applications though, such as Windows COM/OLE, WinRT and Linux D-Bus, etc. Still, emulating human interaction does provide a fairly universal capability.
- chillee 253 days ago
  It's very much in the "worse is better" camp.
- janalsncm 253 days ago
  Apps are built for people rather than computers.
- smartician 253 days ago
  If the goal is to emulate human behavior, I'd say there is a case to be made to build for the same interface, and not rely on separate APIs that may or may not reflect the same information as a user sees.
- caeril 253 days ago
  You can blame normies for this. They love their ridiculous point and click (and tap) interfaces.
  Fortunately, with function-calling (and recently, with guaranteed data structure), we've had access to application interoperability with LLMs for a while now.
  Don't get mad at a company for developing for the masses - that's what they are expected to do.
  But they built for us, first.
- rfoo 253 days ago
  It's quite sad that application interoperability requires parsing text passed via pipes instead of exchanging structured information.
  Like others said, worse is better.
  [-]
  - jlpom 252 days ago
    Your comment takes 630 bits, the screenshot of your comment on my computer takes 2.1 MB, about 218k times the size. Either this is a compute overhead the LLM has to do before it can think about the meaning of the text, or if it's a E2E feedforward architecture, less thinking about it. This is simple for us because neurons in the retina pre-process its stream so that less than 0.8 % is sent to the visual cortex and because we have evolved to very efficiently and quickly extract meaning from our vision. This is a prime example of the Moravec's paradox.
- SuaveSteve 253 days ago
  The people have chosen apps over protocols.
  [-]
  - jampekka 253 days ago
    Worse is better.
mergisi 252 days ago
My First Experience with Claude Computer Use - It's Mind-Blowing!
Just tested Claude's new Computer Use feature and had to share this simple but powerful test:
My Basic Prompt: "Please: 1. Search Amazon for 3 wireless earbuds: Find price Rating Brand name
2. Make a simple Excel file 'earbuds.xlsx': Put the information in a basic table Add colors to the headers Sort by price
3. Show me the results"
What blew my mind: - Claude actually looked at my screen - Moved the mouse by itself - Clicked buttons like a human - Created reports automatically
It's like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.
For those interested: https://mergisi.medium.com/8f56f683e307
[-]
- mirkodrummer 252 days ago
  This comment seems generated, or having a good marketing copy
- AmigoCharlie 252 days ago
  ... it's almost as Claude Computer Use could (and will) just replace you entirely, huh?
GavinGruesome 253 days ago
[dead]
baq 253 days ago
Scary stuff.
'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'
c.f. IBM 1979.
freediver 253 days ago
Both new Sonnet and gpt-4o still fail at a simple:
"How many w's are in strawberry?"
gpt-4o: There are 2 "w's" in "strawberry."
Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.
(same question with 'r' succeeds)
What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.
[-]
- wild_egg 253 days ago
  It's bad at directly working on classical computer problems like math and data processing. But you can do it indirectly by having it write a program that produces the correct result. Interestingly, I didn't even have to have it run the program, although usually you would
```
    write a tool which counts the number of w's in "strawberry" and return the result
```
  Which produced:
```
    Here's a simple Python function that counts the number of 'w's in the word "strawberry" and returns the result:

    ```python
      def count_w_in_strawberry():
          word = "strawberry"
          count = word.count('w')
          return count
  
      # Call the function and print the result
      result = count_w_in_strawberry()
      print(f"The number of 'w's in 'strawberry' is: {result}")
    ```

    This tool does the following:

    1. We define a function called `count_w_in_strawberry()`.
    2. Inside the function, we assign the word "strawberry" to a variable called `word`.
    3. We use the `count()` method on the `word` string to count the occurrences of 'w'.
    4. The function returns the count.
    5. Outside the function, we call `count_w_in_strawberry()` and store the result in the `result` variable.
    6. Finally, we print the result.

    When you run this code, it will output:

    ```
      The number of 'w's in 'strawberry' is: 1
    ```

    This tool correctly identifies that there is one 'w' in the word "strawberry".
```
  [-]
  - jandrese 253 days ago
    I always thought the halting problem was an academic exercise, but here we see a potential practical use case. Actually this seems pretty dangerous letting the LLM write and automatically execute code. How good is the sandbox? Can I trick the LLM into writing a reverse shell and opening it up for me?
    [-]
    - okasaki 252 days ago
      I'm not sure it's still the case, but I've had ChatGPT run shell commands. But I don't know what you could do since it's ephemeral, doesn't have internet access or root. Plus I'm sure they have security scanning.
- redox99 253 days ago
  There's always that one tokenization error comment
- int_19h 253 days ago
  There are models around that will answer this question perfectly.
  But why would anyone care? If you want to count w's in a word, that's a Python one-liner, you don't need an LLM for that (although the LLM can write it for you).
- fassssst 253 days ago
  They are trained on tokens not characters.
- ssijak 253 days ago
  Can we stop with these useless strawberry examples?