New Research Reassesses the Value of Agents.md Files for AI Coding

(infoq.com)

19 points | by noemit 18 hours ago

8 comments

nayroclade 17 hours ago
I suspect AGENTS.md files will prove to be a short-lived relic of an era when we had to treat coding agents like junior devs, who often need explicit instructions and guardrails about testing, architecture, repo structure, etc. But when agents have the equivalent (or better) judgement ability as a senior engineer, they can make their own calls about these aspects, and trying to "program" their behaviour via an AGENTS.md file becomes as unhelpful as one engineer trying to micro-manage another's approach to solving a problem.
[-]
- nextaccountic 12 hours ago
  It's a best practice to document things about the code base, so that other devs (even senior devs) don't start to do things differently. This will probably not change
  What I think is short lived is this insistence in separating LLM instructions from general documentation for both humans and AI. LLMs can read human docs, and concerns about context window size will probably disappear
  But maybe future docs will be LLM-first, but people won't read them directly. They will ask a LLM questions about it
- jbergqvist 13 hours ago
  I think AGENTS.md will still have a place regardless. There are conventions, design philosophies, and project-specific constraints that can't be inferred from code alone, no matter how good the judgment
- sdenton4 17 hours ago
  Eh, even for a senior engineer, dropping into a new codebase is greatly helped by an orientation from someone who works on the code. What's where, common gotchas, which tests really matter, and so on. The agents file serves a similar role.
  [-]
  - fennecbutt 14 hours ago
    Yup, readmes exist for a reason even for meat bags
    [-]
    - derefr 9 hours ago
      Except that most READMEs are seemingly written more for end-users than for developers; and even CONTRIBUTING files often mostly just document the social contribution process + guidelines rather than providing any guidance targeted toward those who would contribute. There’s a lot of “top-level architectural assumptions” detail in particular that is left on the floor, documented nowhere. Which “works” when you expect human devs to “stare really hard and ask questions” until they figure out what’s being done differently in this codebase; but doesn’t work at all when an LLM with zero permanent learning capability gets involved.
lmeyerov 17 hours ago
I liked they did this work + its sister paper, but disliked how it was positioned basically opposite of the truth.
The good: It shows on one kind of benchmark, some flavors of agentically-generated docs don't help on that task. So naively generating these, for one kind of task, doesn't work. Thank you, useful to know!
The bad: Some people assume this means in general these don't work, or automation can't generate useful ones.
The truth: Instruction files help measurably, and just a bit of engineering enables you to guarantee high scores for the typical cases. As soon as you have an objective function, you can flip it into an eval, and set an AI coder to editing these files until they work.
Ex: We recently released https://github.com/graphistry/graphistry-skills for more easily using graphistry via AI coding, and by having our authoring AI loop a bit with our evals, we jumped the scores from 30-50% success rate to 90%+. As we encounter more scenarios (and mine them from our chats etc), it's pretty straight forward to flip them into evals and ask Claude/Codex to loop until those work well too.
We do these kind of eval-driven AI coding loops all the time , and IMO how to engineer these should be the message, not that they don't work on average. Deeper example near the middle/end of the talk here: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...
noemit 18 hours ago
The research mostly points to LLM-generated context lowering performance. Human-generated context improves performance, but any kind of AGENTS.md file increases token use, for what they say is "fake thinking." More research is needed.
[-]
- d1sxeyes 18 hours ago
  Agree. Also, sometimes I intentionally want the agent to do something differently to how it would naturally solve the problem. For example, there might be a specific design decision that the agent should adhere to. Obviously, this will lead to slower task completion, higher inference costs etc. because I’m asking the agent not to take the path of least resistance.
  This kind of benchmark completely misses that nuance.
- stingraycharles 18 hours ago
  I’d say that it needs to be maintained and reviewed by a human, but it’s perfectly fine to let an LLM generate it.
  [-]
  - sheept 17 hours ago
    If you let an LLM generate it (e.g. Claude's /init), it'll be a lot more verbose then it needs to be, which wastes tokens and deemphasizes any project-specific preferences you actually want the agent to heed.
CrzyLngPwd 17 hours ago
I have a legacy codebase of around 300k lines spread across 1.5k files, and have had amazing success with the agents.md file.
It just prevents hallucinations and coerces the AI to use existing files and APIs instead of inventing them. It also has gold-standard tests and APIs as examples.
Before the agents file, it was just chaos of hallucinations and having to correct it all the time with the same things.
[-]
- OutOfHere 17 hours ago
  You might have better luck with more focused task-specific instructions if you can be bothered to write them.
  [-]
  - CrzyLngPwd 16 hours ago
    I do write them, but without the agents file, I would end up pasting the same contents each time I start work on something, which seems like a waste of time.
    [-]
    - derefr 9 hours ago
      No? What you’re looking for here are project-specific agent skill files. Which don’t need to be “pointed at” by AGENTS.md for the agent to find them and use them when/where applicable; and nor do they take up undue context when not in use. (I believe internally the agentive coding frameworks RAG on skills and then search this RAG for applicable skills to pull temporarily into context at the beginning of each planning step.)
    - OutOfHere 13 hours ago
      What that tells me is that you don't narrow the focus of the LLM to the task at hand. Instead of giving it a hundred pages of information for every task, if you give it a single page that is both relevant and required, your result could be tighter.
dev_l1x_be 17 hours ago
I never use these files and give the current guardrails of a specific task to each short run for agents. Have task specific “agents.md” works better for me.
OutOfHere 17 hours ago
Duplicate of https://news.ycombinator.com/item?id=47280099
stingraycharles 18 hours ago
What is going on in this thread and why are all comments downvoted so heavily?
verdverm 18 hours ago
That research has been so misinterpreted for headlines and clicks...
AGENTS.md are extremely helpful if done well.
[-]
- lucketone 18 hours ago
  Everybody thinks they do agents.md well
  [-]
  - verdverm 10 hours ago
    I do them minimally, but that's besides the point. The research is about machine generated AGENTS.md, this research is insufficient to be making broad generalizations on.
    Consider how many papers there are showing both productivity increases and decreases. The jury is out, same here. Anecdotally and intuitively, AGENTS.md makes a positive difference. One such example is providing the build / test commands. If you don't do this, how should the agent determine how to validate its changes?