Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

(senior-swe-bench.snorkel.ai)

46 points | by matt_d 2 hours ago

12 comments

  • magnio 34 minutes ago
    I saw on Twitter that in an ML course at Tsinghua University, one of the tests asks students to write quizzes that fail the most LLM models as possible.

    What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.

  • 0xbadcafebee 34 minutes ago
    The "tasteful solves" is codified cargo culting. The software industry has a tendency to anthropomorphize software while playing to the ego of the programmer. The programmer imagines they are creating a "beautiful" artistic expression. Good code becomes "tasteful", as a software artist must have "good taste" to tell the good software from the bad software. Good quality lacks "bad smells", because a good artist has fine senses (and everybody must like the same smells). "Fine craftsmanship", in code as in woodworking, means your finely-crafted work is "technically superior", so you can charge more money for something that could've been made cheaper and faster and done the same thing.

    But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.

    Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.

    • Dban1 5 minutes ago
      As time passes we will have fewer and fewer literati
  • facorreia 27 minutes ago
    It's nice to see a new public benchmark from Snorkel. They're doing some pretty sophisticated stuff over there.
  • _345 23 minutes ago
    This makes so much sense as to why I've always felt that Opus 4.8 was leagues ahead of GPT 5.5. It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project
    • re-thc 6 minutes ago
      > It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project.

      At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.

      It's also possible that it's just a harness problem more than model.

  • monster_truck 19 minutes ago
    Once again I am asking: who are these people and what makes them more qualified than any of you to asses anyone or anything "as a senior engineer" (with the subtext being that none of you are, either)
    • re-thc 5 minutes ago
      > who are these people and what makes them more qualified than any of you

      Anyone can run something and make a web page. These people just do it instead of questioning. Main difference. If everyone asks "how could you" "are you qualified" then we have nothing but gatekeeping.

  • jonathanleane 2 hours ago
    Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
    • lacunary 2 hours ago
      presumably whatever the top model uses and then some, since the human can use the model.

      I wonder if a model could score higher if it had a human at its disposal?

      • pishpash 29 minutes ago
        Maybe models should ask for human-in-the-loop input, as a matter of convention.
  • LiamPowell 1 hour ago
    > You are a senior SWE-Bench reviewer, make no mistakes.

    I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.

  • guilhermecgs 53 minutes ago
    fable 5?
    • guessmyname 36 minutes ago
      The people who created the benchmark(s) don’t have access to Fable 5.
  • Madmallard 1 hour ago
    next round of trust me bro benchmarks
    • dozerly 44 minutes ago
      Just wait for the next 100 rounds. People love seeing the 65% -> 85% seemingly over and over again for every new model.
  • danpalmer 2 hours ago
    Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s

    But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.

    Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.

    • glaslong 56 minutes ago
      Principal-SWE-Bench will take some time to run, because the LLM needs to wait for a crisis to present its solution, having correctly identified that the same solution would have been organizationally impossible to propose until that moment.
    • amrrs 1 hour ago
      As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
      • allan_s 1 hour ago
        I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?
  • jocelyner 1 hour ago
    [flagged]
  • purple-leafy 2 hours ago
    Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

    What you really need is an objective benchmark

    • eli 1 hour ago
      I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
      • charcircuit 48 minutes ago
        The issue is that you can't do unsupervised learning if you require humans.
    • echelon 1 hour ago
      > What you really need is an objective benchmark

      "When are all the software engineers unemployed?"