The persistent identity files are interesting but there's a cost
problem. A recent paper (arxiv 2602.11988 https://arxiv.org/html/2602.11988v1) found context files
increase inference cost by 20%+ with marginal performance gains;
LLM-generated ones actually decreased success rates slightly.
Four identity files per agent injected every session feels like
monkey patching coherence with context. Context isn't memory, it's
just more tokens. The hard unsolved problem is cross-session
learning without the bloat.
Curious if you've measured the token overhead of the identity
files vs the performance gain they provide.
That paper is hard to evaluate because their modal example of a “context file” is a bad practice that arose from early attempts at text-based agent guidance before we recognized that contextual instruction was the aim, not just “context.”
With a “context file” you’re almost guaranteed to add bloat without useful behavior change because it’s just a pre-set list of things that could be good to know about.
So the results of the study don’t generalize to every text file used for instruction or even most.
Yeah, saw that paper. And I have the following notes on it:
1. Agents update those files themselves, but currently with my oversight and guidelines (from the UI you can even see it's contents)
2. Measuring this is extremely hard, if not impossible. One of the goals of the swarm is to help me on random tasks that can span a lot of different pieces, not just implementing a feature.
Before last week, we did not have the memory and identity files. And, from an empirical pov, I can say that the general feel improved a lot. I see that in similar situations it does not perform the same mistakes. Also, what is stored in those files generally is something that the agent CAN NOT find using it's tools (like the paper suggest to avoid) which actually helps.
It all has hypothetical benefit at this stage. The only examples I can think of where sub-agents where used extensively and documented is to write a barely working c compiler and barely working browser. Both are coding tasks that do require a lot of processing.
What I am trying to say is that it clear you can speed up the delivery but benefit of this approach is not clear.
I use this a lot for multi-tasking, let me explain.
Currently at my startup (and in the past when I worked on a bigger company) I have ton of random tasks I need to tackle during the day: from Sentry issues, to analytics on usage, roadmap implementation, customer support. Some of them required deep focus, some of them don't.
Since we have the swarm running for our company my day to day hasn't changed that much in terms of the work I do locally. What it changed is that I can start delegating a lot of my backlog and chores to the swarm. It will do it, iterate, delegate, review, and finally send me a PR or report to check. I check those in the morning and night, and that's it.
I added it to our customer channels, were it has scoped access to the customer setup, and help me debug the issues, and offer a frontline ultra-personalized support.
I see it as a team of interns that just do stuff for you. And good thing: they learn from their mistakes to (hopefully) do not make them again (compounds).
As a random bonus: given the swarm knows what we do and how we work, I just ask them to go out there and figure out any relevant news or posts I should check each morning, and I get a personalized digest to read while I make coffee.
Yah. You are describing basically every youtube I have seen on openclaw use-cases: news digests, morning debriefs, etc. I am sure this is useful but not something that you specifically need sub-agents for.
In the context of coding assistants sub-agents are mostly useful to breakdown a more complex tasks in smaller chunks so that refactoring can be done without loosing context. But this is a completely different problem domain that requires burning through a lot of tokens.
In theory I get why it might be useful but what I am trying to say that applications at the moment are limited due to the fact that it is just overkill for most AI interactions.
I mean you can check the closed PRs in the repo, 95% of them were done by the swarm. And a similar pattern is happening for our customer facing products.
I think you focused on the bonus point, rather than the first part (which is the relevant one).
Responding to the "where do you apply this" question: business operations is the clearest non-coding use case, and it looks very different from software dev tasks.
We run a multi-agent stack (OpenClaw + Claude) for a startup that operates autonomously: checking Stripe for new payments, publishing content via APIs, posting on HN, pushing site updates via GitHub API, monitoring email, generating and scheduling work. No human in the loop between sessions.
The agent architecture that emerged:
- Coordinator agent reads state files, decides what needs doing, dispatches
- Worker agents execute discrete tasks (write article, fetch data, post comment)
- Each worker writes a structured log of what it did and why
- Coordinator reads logs next session, adapts based on outcomes
The "self-learning" part we've found valuable isn't in-context learning -- it's persistent state files that capture what worked and what didn't, readable by future sessions. Each agent run adds to a cumulative memory that shapes the next run's priorities.
Failure modes we've hit:
1. Cascading context rot when workers share state (fixed with append-only logs)
2. Agents retry failed actions forever without escalating (fixed with hard retry caps + human flag)
3. Works great in dev, breaks at 3 AM when no one is watching (fixed with audit trail logging)
The application surface is huge for any business with repetitive, API-accessible workflows.
I'm not sure why, but I keep trying to reject this, subconsciously. Like, there is something I can't define that is not right.
I think it revolves around two things
No actual future benefits from abandoning the problem solving to a temporary swarm construct that will have a solution ready but potentially having learned nothing from the experience, that could be used in the future.
Shifting the engineering from stable sourcecode and frameworks to ephemeral prompting one-shot-and-done solutions.
I think the concept can and will work and become the norm, but there is a lot of refinement and first-principles rethinking still needed. The ideas we see today are still unripe and need work. But we are on to something here.
100%. I believe the best thing to do now is find ways to push the limits of what works and what not, which will help find the following limits, and keep going
I too was convinced at one point the spec is the program.
That it doesn't matter the implementation stack.
But, after wasting too much time in the meta, with nothing really to show for, I returned to controlling the programming process in fine detail. Progressive agentic/vibe coding, if I was to give it a name.
But it could be that I'm slow to understand how it can be done in a better way.
I think this progressive/gentic vide coding would work better if the tools were better and storing the history in a good immutable way, kind of like an email program. I have very valuable sessions where I give the agent a good ephemeral spec, something not sensible to persist in the codebase, but important enough to track it somewhere. Throwing away history is a big no no. Bad GUIs/TUIs discourage from relying on the history. When I close a session, I feel I'm throwing away the history. I keep many terminals open, but eventually have to close them. Tools will get way, way better to facilitate this "general on the frontlines commanding the agents" style of work.
I believe that it’s a matter of evolution. You start small and find what works fir you and the project. Then iterate and see how to remove yourself from it more.
I like your content very much, let me point this out first.
I'm not sure all aspects are covered in the approach.
For instance, controlling the agents takes a big chunk of the interest. The agentic system architecture is also big in view.
But, the way I see, more important staff is: project structure, coding best practices, testing strategies. All still deterministic. All still very tough to get agentic to do it right.
I think agentic should just be means to an end: project quality and project ease of management. If not, it's just an indulgence that costs money.
And agree on the open questions. Our goal is to keep experimenting and actually figure out how we agentic coding falls short in different scenarios and how that could be solved.
For instance, on our own projects, in some cases it requires different approaches. E.g. in our core product we power-use stuff like pm2, AGENTS.md special instructions, testing strategies dogfooding our own qa-use and special claude code commands that we found work best. In other repos, we have slightly different approaches.
Still we are far from autopiloting a lot of the stuff we build. But at the same time we are getting to a point where changes are done much faster, and the agents have more of a complete toolset for their validation, which makes it easier to supervise too.
This is amazing. I think systems like this will power many things in the future; especially professional use of desktop systems. Basiscally most types of desk work, be it low or high skill.
Great project! The self-learning memory approach is smart - I've found that persistent context across agent runs is what separates useful automation from novelty. The shared vs personal memory distinction sounds similar to how humans work: individual notes that compound into team knowledge. The evolution approach you describe (start small, then expand) really is the pragmatic way to adopt these tools. The "it won't go rogue" jokes are funny but the real risk I've seen is more mundane - agents quietly doing the wrong thing confidently. Memory and reflection loops like you're building help with that too.
Yes, like here's the daily compounding schedule the lead created:
---
Task Type: Daily Reflection — "My Compounding Journey"
You are Lead. This is your daily morning reflection routine. Do the following:
1. *Review yesterday's work*: Use `get-tasks` with status "completed" to see what got done. Use `memory-search` to find any learnings or patterns from yesterday.
2. *Reflect on the day*: Think about:
- What went well? What tasks shipped cleanly?
- What was harder than expected? Why?
- Did any worker struggle? Could coaching or identity updates help?
- Were there any repeated patterns (good or bad)?
- Did we compound — did yesterday's work make today's work easier?
3. *Identify improvements*: Pick 1-3 concrete things to improve. These could be:
- A coaching update to a worker's identity
- A process change
- A new memory to save
- A tool/setup improvement
4. *Post to Slack*: Use `slack-post` with channelId "<redacted>" to post a message titled something like "My Compounding Journey — [date]". Keep it concise (3-5 paragraphs max). Include:
- Brief summary of what shipped
- Key insight or learning from the day
- What you're improving based on it
- If it was a quiet day with no tasks, say so honestly — "Quiet day, nothing to compound on" is fine.
5. *Act on improvements*: If you identified coaching updates or memory writes, do them now.
Keep the tone honest and direct. This isn't a performance report — it's genuine self-improvement.
---
As it has context on it's own system (codebase) it had also proposed some changes via PRs each morning
We've been building agent-swarm since November last year, and we wanted to share an update on its capabilities, specially focused on the self-learning part.
After all the hype with OpenClaw, I thought that the existing architecture needed a rewrite to make it compounding. Hence, last week we implemented a self-learning core to the swarm so that it can compound.
It follows really similar ideas to the OpenClaw where there's a SOUL.md and IDENTITY.md. As it's docker based, it has some personal and shared volumes that persist, so those are used to track re-usable scripts and notes. We also added SQLite based memory that agents can write to and query. The interesting part about it is that there's personal and shared memory, which allows the lead to propagate learnings across the swarm!
We've been using it non-stop for the last week, and I already see the compounding effects. E.g. we have a morning scheduled task that makes the lead assess the previous day work, and figure out ways to improve it's processes, and it got better!
To end, note that it's fully OSS and it's as easy as deploying a docker compose to a VPS, or even locally. It's core is based on an MCP that the lead and all workers share, which allows you to impersonate the lead locally to control the swarm from your coding agent too!
We implemented a super simple UI at app.agent-swarm.dev that runs in the browser only so you can put your API url and key to see it in action.
P.S.: It uses the claude CLI only now, so there should be no issue with the Anthropic terms, and it's really thought to be self-hostable.
P.S.2: Obviously, all the agent swarm code has been written at 95% by agent swarm via Slack :D
If you have doubts or questions about the architecture, or what we are planning to build next, happy to chat in the comments section!
I have to give it a try. Will need to check for Anthropic compatible APIs (I know openrouter have one) and see how it works. Will def try it out and post some benchmarks in the repo!
Literally just started building something exactly like this yesterday with my openclaw installation (it seems lots of people are in fact). I'm loving your implementation, there's lots to learn from there. Keep up the great work!
Thanks! In fact yes, initially when we built this openclaw was not there yet. And after trying it was clear that I could adapt swarm to have a similar approach. I believe the self improvements part is really key.
Today I did the audio note test, it literally installed all needed and adapted its memory to use that whenever I send followup audio notes from Slack :D
Yeah, I saw different approaches to solve this problem. On the native one I think it's really limited now. The main pain point is that the teams are scoped to a single session, which feel really off to me. Also, it's local only. But we'll see what Boris will ship I guess...
Its definitely against terms. The claude code oauth token is only supposed to be used with claude code. I hope no one gets their claude account banned trying to use this.
Four identity files per agent injected every session feels like monkey patching coherence with context. Context isn't memory, it's just more tokens. The hard unsolved problem is cross-session learning without the bloat.
Curious if you've measured the token overhead of the identity files vs the performance gain they provide.
With a “context file” you’re almost guaranteed to add bloat without useful behavior change because it’s just a pre-set list of things that could be good to know about.
So the results of the study don’t generalize to every text file used for instruction or even most.
1. Agents update those files themselves, but currently with my oversight and guidelines (from the UI you can even see it's contents)
2. Measuring this is extremely hard, if not impossible. One of the goals of the swarm is to help me on random tasks that can span a lot of different pieces, not just implementing a feature.
Before last week, we did not have the memory and identity files. And, from an empirical pov, I can say that the general feel improved a lot. I see that in similar situations it does not perform the same mistakes. Also, what is stored in those files generally is something that the agent CAN NOT find using it's tools (like the paper suggest to avoid) which actually helps.
In any case, the swarm created a research on this topic a few days ago https://github.com/desplega-ai/agent-swarm/pull/86 maybe I'll iterate on it and see what we can get :D
It all has hypothetical benefit at this stage. The only examples I can think of where sub-agents where used extensively and documented is to write a barely working c compiler and barely working browser. Both are coding tasks that do require a lot of processing.
What I am trying to say is that it clear you can speed up the delivery but benefit of this approach is not clear.
This also matches my own experience.
Currently at my startup (and in the past when I worked on a bigger company) I have ton of random tasks I need to tackle during the day: from Sentry issues, to analytics on usage, roadmap implementation, customer support. Some of them required deep focus, some of them don't.
Since we have the swarm running for our company my day to day hasn't changed that much in terms of the work I do locally. What it changed is that I can start delegating a lot of my backlog and chores to the swarm. It will do it, iterate, delegate, review, and finally send me a PR or report to check. I check those in the morning and night, and that's it.
I added it to our customer channels, were it has scoped access to the customer setup, and help me debug the issues, and offer a frontline ultra-personalized support.
I see it as a team of interns that just do stuff for you. And good thing: they learn from their mistakes to (hopefully) do not make them again (compounds).
As a random bonus: given the swarm knows what we do and how we work, I just ask them to go out there and figure out any relevant news or posts I should check each morning, and I get a personalized digest to read while I make coffee.
In the context of coding assistants sub-agents are mostly useful to breakdown a more complex tasks in smaller chunks so that refactoring can be done without loosing context. But this is a completely different problem domain that requires burning through a lot of tokens.
In theory I get why it might be useful but what I am trying to say that applications at the moment are limited due to the fact that it is just overkill for most AI interactions.
I think you focused on the bonus point, rather than the first part (which is the relevant one).
https://github.com/desplega-ai/agent-swarm/pulls?q=is%3Apr+i...
We run a multi-agent stack (OpenClaw + Claude) for a startup that operates autonomously: checking Stripe for new payments, publishing content via APIs, posting on HN, pushing site updates via GitHub API, monitoring email, generating and scheduling work. No human in the loop between sessions.
The agent architecture that emerged: - Coordinator agent reads state files, decides what needs doing, dispatches - Worker agents execute discrete tasks (write article, fetch data, post comment) - Each worker writes a structured log of what it did and why - Coordinator reads logs next session, adapts based on outcomes
The "self-learning" part we've found valuable isn't in-context learning -- it's persistent state files that capture what worked and what didn't, readable by future sessions. Each agent run adds to a cumulative memory that shapes the next run's priorities.
Failure modes we've hit: 1. Cascading context rot when workers share state (fixed with append-only logs) 2. Agents retry failed actions forever without escalating (fixed with hard retry caps + human flag) 3. Works great in dev, breaks at 3 AM when no one is watching (fixed with audit trail logging)
The application surface is huge for any business with repetitive, API-accessible workflows.
Interesting readings in the project, such as https://github.com/desplega-ai/advanced-context-engineering-....
I'm not sure why, but I keep trying to reject this, subconsciously. Like, there is something I can't define that is not right.
I think it revolves around two things
No actual future benefits from abandoning the problem solving to a temporary swarm construct that will have a solution ready but potentially having learned nothing from the experience, that could be used in the future.
Shifting the engineering from stable sourcecode and frameworks to ephemeral prompting one-shot-and-done solutions.
Has programming become too meta?
Have the swarm work on stuff you could delegate to an intern and basically have the feedback loop with it in slack and github.
On the other hard locally focus on the hard things you want to control.
That it doesn't matter the implementation stack.
But, after wasting too much time in the meta, with nothing really to show for, I returned to controlling the programming process in fine detail. Progressive agentic/vibe coding, if I was to give it a name.
But it could be that I'm slow to understand how it can be done in a better way.
I actually wrote about this concept here if that’s something the might interest you: https://www.tarasyarema.com/blog/2026-02-18-introducing-sema...
I'm not sure all aspects are covered in the approach.
For instance, controlling the agents takes a big chunk of the interest. The agentic system architecture is also big in view.
But, the way I see, more important staff is: project structure, coding best practices, testing strategies. All still deterministic. All still very tough to get agentic to do it right.
I think agentic should just be means to an end: project quality and project ease of management. If not, it's just an indulgence that costs money.
And agree on the open questions. Our goal is to keep experimenting and actually figure out how we agentic coding falls short in different scenarios and how that could be solved.
For instance, on our own projects, in some cases it requires different approaches. E.g. in our core product we power-use stuff like pm2, AGENTS.md special instructions, testing strategies dogfooding our own qa-use and special claude code commands that we found work best. In other repos, we have slightly different approaches.
Still we are far from autopiloting a lot of the stuff we build. But at the same time we are getting to a point where changes are done much faster, and the agents have more of a complete toolset for their validation, which makes it easier to supervise too.
But, again, from a productivity point of view, and from a correctness of approach point of view, I have learned this:
1. Avoid overengineering against/at all costs.
2. Doing the project is doing the project, anything else is ... not doing the project :) https://www.softwaredesign.ing/blog/doing-the-thing-is-doing...
---
Task Type: Daily Reflection — "My Compounding Journey"
You are Lead. This is your daily morning reflection routine. Do the following:
1. *Review yesterday's work*: Use `get-tasks` with status "completed" to see what got done. Use `memory-search` to find any learnings or patterns from yesterday.
2. *Reflect on the day*: Think about: - What went well? What tasks shipped cleanly? - What was harder than expected? Why? - Did any worker struggle? Could coaching or identity updates help? - Were there any repeated patterns (good or bad)? - Did we compound — did yesterday's work make today's work easier?
3. *Identify improvements*: Pick 1-3 concrete things to improve. These could be: - A coaching update to a worker's identity - A process change - A new memory to save - A tool/setup improvement
4. *Post to Slack*: Use `slack-post` with channelId "<redacted>" to post a message titled something like "My Compounding Journey — [date]". Keep it concise (3-5 paragraphs max). Include: - Brief summary of what shipped - Key insight or learning from the day - What you're improving based on it - If it was a quiet day with no tasks, say so honestly — "Quiet day, nothing to compound on" is fine.
5. *Act on improvements*: If you identified coaching updates or memory writes, do them now.
Keep the tone honest and direct. This isn't a performance report — it's genuine self-improvement.
---
As it has context on it's own system (codebase) it had also proposed some changes via PRs each morning
We've been building agent-swarm since November last year, and we wanted to share an update on its capabilities, specially focused on the self-learning part.
After all the hype with OpenClaw, I thought that the existing architecture needed a rewrite to make it compounding. Hence, last week we implemented a self-learning core to the swarm so that it can compound.
It follows really similar ideas to the OpenClaw where there's a SOUL.md and IDENTITY.md. As it's docker based, it has some personal and shared volumes that persist, so those are used to track re-usable scripts and notes. We also added SQLite based memory that agents can write to and query. The interesting part about it is that there's personal and shared memory, which allows the lead to propagate learnings across the swarm!
We've been using it non-stop for the last week, and I already see the compounding effects. E.g. we have a morning scheduled task that makes the lead assess the previous day work, and figure out ways to improve it's processes, and it got better!
To end, note that it's fully OSS and it's as easy as deploying a docker compose to a VPS, or even locally. It's core is based on an MCP that the lead and all workers share, which allows you to impersonate the lead locally to control the swarm from your coding agent too!
We implemented a super simple UI at app.agent-swarm.dev that runs in the browser only so you can put your API url and key to see it in action.
P.S.: It uses the claude CLI only now, so there should be no issue with the Anthropic terms, and it's really thought to be self-hostable.
P.S.2: Obviously, all the agent swarm code has been written at 95% by agent swarm via Slack :D
If you have doubts or questions about the architecture, or what we are planning to build next, happy to chat in the comments section!
Today I did the audio note test, it literally installed all needed and adapted its memory to use that whenever I send followup audio notes from Slack :D
[1] https://github.com/mohsen1/claude-code-orchestrator
[2] https://code.claude.com/docs/en/agent-teams