The real thing I think people are rediscovering with file system based search is that there’s a type of semantic search that’s not embedding based retrieval. One that looks more like how a librarian organizes files into shelves based on the domain.
We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.
We started with LLMs when everyone in search was building question answering systems. Those architectures look like the vector DB + chunking we associate with RAG.
Agents ability to call tools, using any retrieval backend, call that into question.
We really shouldn’t start RAG with the assumption we need that. I’ll be speaking about the subject in a few weeks
Right. R in RAG stands for retrieval, and for a brief moment initially, it meant just that: any kind of tool call that retrieves information based on query, whether that was web search, or RDBMS query, or grep call, or asking someone to look up an address in a phone book. Nothing in RAG implies vector search and text embeddings (beyond those in the LLM itself), yet somehow people married the acronym to one very particular implementation of the idea.
Yeah there's a weird thing where people would get really focused on whether something is "actually doing RAG" when it's pulling in all sorts of outside information, just not using some kind of purpose built RAG tooling or embeddings.
Now, the pendulum on that general concept seems to be swinging the opposite direction where a lot of those people just figured out that you don't need embeddings. That's true, but I'd suggest that people don't overindex on thinking that means embeddings are not actually useful or valuable. Embeddings can be downright magical in what you can build with them, they're just one more tool at your disposal.
You can mix and match these things, too! Indexing your documents into semantically nested folders for agents to peruse? Try chunking and/or summarizing each one, and putting the vectors in sidecar files, or even Yaml frontmatter. Disks are fast these days, you can rip through a lot of files indexed like that before you come close to needing something more sophisticated.
You seem like someone who knows what they're doing, and I understand the theoretical underpinnings of LLMs (math background), but I have little kids that were born in 2016 and so the entire AI thing has left me in the dust. Never any time to even experiment.
I am active in fandoms and want to create a search where someone can ask "what was that fanfic where XYZ happened?" and get an answer back in the form of links to fanfiction that are responsive.
This is a RAG system, right? I understand I need an actual model (that's something like ollama), the thing that trawls the fanfiction archive and inserts whatever it's supposed to insert into one of these vector DBs, and I need a front-facing thing I write, that takes a user query, sends it to ollama, which can then search the vector DB and return results.
Or something like that.
Is it a RAG system that solves my use case? And if so, what software might I go about using to provide this service to me and my friends? I'm assuming it's pretty low in resource usage since it's just text indexing (maybe indexing new stuff once a week).
The goal is self-hosting. I don't wanna be making monthly payments indefinitely for some silly little thing I'm doing for me and my friends.
I am just a stay at home dad these days and don't have anyone to ask. I'm totally out the tech game for a few years now. I hope that you could respond (or someone else could), and maybe it will help other people.
There's just so many moving parts these days that I can't even hope to keep up. (It's been rather annoying to be totally unable to ride this tech wave the way I've done in the past; watching it all blow by me is disheartening).
In the definition of RAG discussed here, that means the workflow looks something like this (simplified for brevity): When you send your query to the server, it will first normalise the words, then convert them to vectors, or embeddings, using an embedding model (there are also plain stochastic mechanisms to do this, but today most people mean a purpose-built LLM). An embedding is essentially an array of numeric coordinates in a huge-dimensional space, so [1, 2.522, …, -0.119].
It can now use that to search a database of arbitrary documents with pre-generated embeddings of their own. This usually happens during inserting them to the database, and follows the same process as your search query above, so every record in the database has its own, discrete set of embeddings to be queried during searches.
The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.
That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.
I think the example you give is a little backwards — a RAG system searches for relevant content before sending anything to the LLM, and includes any content retrieved this way in the generative prompt. User query -> search -> results -> user query + search results passed in same context to LLM.
I don't think this was a simple assumption. LLMs used to be much dumber!
GPT-3 era LLMS were not good at grep, they were not that good at recovering from errors, and they were not good at making followup queries over multiple turns of search. Multiple breakthroughs in code generation, tool use, and reasoning had to happen on the model side to make vector-based RAG look like unnecessary complexity
Similar effort with PageIndex [1], which basically creates a table of contents like tree. Then an LLM traverses the tree to figure out which chunks are relevant for the context in the prompt.
This kind of circles back to ontological NLP, that was using knowledge representation as a primitive for language processing. There is _a ton_ of work in that direction.
I think it's cool that LLMs can effectively do this kind of categorization on the fly at relatively large scale. When you give the LLM tools beyond just "search", it really is effectively cheating.
This is one of the most confusing claims I've seen in a long time. Grep and others over files would be the equivalent of an old fashioned keyword search where most RAG uses vector search. But everything else they claim about a file system just suggests that they don't know anything about databases.
I'm not familiar with how most out of the box RAG systems categorize data, but with a database you can index content literally in any way you want. You could do it like a filesystem with hierarchy, you could do it tags, or any other design you can dream up.
The search can be keyword, like grep, or vector, like rag, or use the ranking algorithms that traditional text search uses (tf-idf, BM25), or a combination of them. You don't have to use just the top X ranked documents, you could, just like grep, evaluate all results past whatever matching threshold you have.
Search is an extremely rich field with a ton of very good established ways of doing things. Going back to grep and a file system is going back to ... I don't know, the 60s level of search tech?
Sorry, this still makes no sense. LLMs don't care about files. The way most codings systems work is that they simply provide the whole file to the LLM rather than a subset of it. That's just a choice in how you implemented your RAG search system and database. In this case the "record" is big, a file. No doubt that works for code, but it's nonsensical outside that.
E.g. for wikipedia the logical unit would likely be an article. For a book, maybe it's a chapter, or maybe it's a paragraph. You need to design the system around your content and feed the LLM an appropriate logically related set of data.
This feels like massive overengineering just to bypass naive chunking. Emulating a POSIX shell in TS on top of ChromaDB to do hierarchical search is going to destroy your TTFT. Every ls and grep the agent decides to run is a separate inference cycle. You're just trading RAG context-loss for severe multi-step latency
I am really enjoying this renaissance in CLI world applications. There's so much possible.
I'm working on a related challenge which is mounting a virtual filesystem with FUSE that mirrors my Mac's actual filesystem (over a subtree like ~/source), so I can constrain the agents within that filesystem, and block destructive changes outside their repo.
I have it so every repo has its own long-lived agent. They do get excited and start changing other repos, which messes up memory.
I didn't want to create a system user per repo because that's obnoxious, so I created a single claude system user, and I am using the virtual file system to manage permissions. My gmail repo's agent can for instance change the gmail repo and the google_auth repo, but it can't change the rag repo.
Relative to making docs accessible to AI via filesystem tools, I've been looking around to see what kinds of patterns SDK authors are using to get AI coding agents to use the freshest documentation, and Vercel is doing something interesting with their AI SDK that I haven't seen elsewhere (although maybe I just haven't looked hard enough).
The "ai" npm package includes a root-level docs folder containing .mdx versions of the docs from their site, specific to the version of the package. Their intended AI-assisted developer experience is that people discover and install their ai-sdk skill (via their npx skills tool, which supports discovery and install of skills from most any provider, not just Vercel). The SKILL.md instructs the agent to explicitly ignore all knowledge that may have been trained into its model, and to first use grep to look for docs in node_modules/ai/docs/ before searching the website.
This mirrors something we ran into building an AI pipeline for audio content. The problem with traditional RAG is that chunking destroys the structure that actually matters — you end up retrieving fragments that are semantically similar but contextually useless.
The filesystem metaphor works because it preserves heirarchy. Documents have sections, sections have relationships, and those relationships carry meaning that gets lost when you flatten everything into embeddings.
Curious how this handles versioning though. Docs change constantly and stale context fed to an LLM is arguably worse than no context at all.
Seems like it would be simpler to give the agent tools to issue ChromaDB (or SQL) queries directly, rather than giving the LLM unix-like tools that are converted into queries under the hood using a complicated proprietary setup.
I don't know - we are discussing techniques - like having information in files, or in a semantic database, or in a relational database - as if there was one way that could dominate all information access. But finding the right information is not one task - if the needed information is a summary of expenses from a period of time then the best source of it will be a relational database, if it is who is the head of the HR department in a particular company - then it could probably be easy found on the company intranet pages (which are kind of graph database).
It does not really matter much if the searcher is a human or LLM - there are some differences in the speed, the one time useful context length and the fact that LLMs are amnesiac - but these are just parameters, the task for humans is immensely complicated and there is no one architecture and there will not be one for LLMs.
I also vibed a brainstorming note with my knowledge base system. The initial prompt:
"""when I read "We replaced RAG with a virtual filesystem for our AI documentation assistant (mintlify.com)" title on HackerNews - the discussion is about RAG, filesystems, databases, graphs - but maybe there is something more fundamental in how we structure the systems so that the LLM can find the information needed to answer a question. Maybe there is nothing new - people had elaborate systems in libraries even before computers - but maybe there is something. Semantic search sounds useful - but knowing which page to return might be nearly as difficult as answering the question itself - and what about questions that require synthesis from many pages? Then we have distillation - an table of content is a kind of distillation targeting the task of search.
"""
Then I added a few more comments and the llm linked the note with the other pages in my kb. I am documenting that - because there were many voices against posting LLM generated content and that a prompt will be enough. IMHO the prompt is not enough - because the thought was also grounded in the whole theory I gathered in the KB. And that is also kind of on topic here. Anyway - here is the vibed note: https://zby.github.io/commonplace/notes/charting-the-knowled...
I am not familiar with the tech stack they use, but from an outsider point of view, I was sort of expecting some kind of fuse solution. Could someone explain why they went through a fake shell? There has to be a reason.
100% agree a FUSE mount would be the way to go given more time and resources.
Putting Chroma behind a FUSE adapter was my initial thought when I was implementing this but it was way too slow.
I think we would also need to optimize grep even if we had a FUSE mount.
This was easier in our case, because we didn’t need a 100% POSIX compatibility for our read only docs use case because the agent used only a subset of bash commands anyway to traverse the docs. This also avoids any extra infra overhead or maintenance of EC2 nodes/sandboxes that the agent would have to use.
Yah my Claude Code agents run a ton of Python and bash scripts. You're probably missing out on a lot of tool use cases without full tool use through POSIX compatibility.
This is definitely the way. There are good use cases for real sandboxes (if your agent is executing arbitrary code, you better it do so in an air-gapped environment).
But the idea of spinning up a whole VM to use unix IO primitives is way overkill. Makes way more sense to let the agent spit our unix-like tool calls and then use whatever your prod stack uses to do IO.
I agree that would have been the way to go given more time and resources. However, setting up a FUSE mount would have taken significantly longer and required additional infrastructure.
I think this is a great approach for a startup like Mintlify. I do have skepticism around how practical this would be in some of the “messier” organisations where RAG stands to add the most value. From personal experience, getting RAG to work well in places where the structure of the organisation and the information contained therein is far from hierarchical or partition-able is a very hard task.
The use case is well defined here, let’s not jump the gun. Text search, like with code, is a relatively simple problem compared to intrinsic semantic content in a book for example. I think the moral here is that RAG is not a silver bullet, the claude code team came to the same conclusion.
Modern OCR tooling is quite good. If the knowledge you are adding into your search database is able to be OCR'd then I think the approach we took here is able to be generalized.
I don't get it - everybody in this thread is talking about the death of vector DBs and files being all you need. The article clearly states that this is a layer on top of their existing Chroma db.
Which sounds like a great idea, except that is uses NFS instead of FUSE (note that macFUSE now has a FSKit backend so FUSE seems like the best solution for both Mac and Linux).
This puts a lot of LLM in front of the information discovery. That would require far more sophisticated prompting and guardrails. I'd be curious to see how people architect an LLM->document approach with tool calling, rather than RAG->reranker->LLM. I'm also curious what the response times are like since it's more variable.
Hmmm, the post is an attempt to explain that Mintlify migrated from embedding-retrieval->reranker->LLM to an agent loop with access to call POSIX tools as it desires. Perhaps we didn't provide enough detail?
That matches what I'm curious about. Where an LLM is doing the bulk of information discovery and tool calling directly. Most simpler RAGs have an LLM on the frontend mostly just doing simpler query clean up, subqueries and taxonomy, then again later to rerank and parse the data. So I'd imagine the prompting and guardrails part is much more complicated in an agent loop approach, since it's more powerful and open ended.
> even a minimal setup (1 vCPU, 2 GiB RAM, 5-minute session lifetime) would put us north of $70,000 a year based on Daytona's per-second sandbox pricing ($0.0504/h per vCPU, $0.0162/h per GiB RAM)
$70k?
how about if we round off one zero? Give us $7000.
Hm. I think a dedicated 16-core box with 64 ram can be had for under $1000/year.
It being dedicated there are no limits on session lifetime and it'd run 16 those sessions no problem, so the real price should be around ~$70/year for that load.
Let's say I want a free, local or free-tier-llm, simple solution to search information mostly from my emails and a little bit from text, doc and pdf files. Are there any tool I should try to have ollamma or gemini able to reply with my own knowledge base?
Agentic RAG: A More Powerful Approach
We can overcome these limitations by implementing an Agentic RAG system - essentially an agent equipped with retrieval capabilities. This approach transforms RAG from a rigid pipeline into an interactive, reasoning-driven process.
The innovation of the blogpost is in the retrieval step.
If grep and ls do the trick, then sure you don't need RAG/embeddings. But you also don't need an LLM: a full text search in a database will be a lot more performant, faster and use less resources.
> "The agent doesn't need a real filesystem; it just needs the illusion of one. Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs: a virtual filesystem that intercepts UNIX commands and translates them into queries against that same database. Session creation dropped from ~46 seconds to ~100 milliseconds, and since ChromaFs reuses infrastructure we already pay for, the marginal per-conversation compute cost is zero."
Not to be "that guy" [0], but (especially for users who aren't already in ChromaDB) -- how would this be different for us from using a RAM disk?
> "ChromaFs is built on just-bash ... a TypeScript reimplementation of bash that supports grep, cat, ls, find, and cd. just-bash exposes a pluggable IFileSystem interface, so it handles all the parsing, piping, and flag logic while ChromaFs translates every underlying filesystem call into a Chroma query."
It sounds like the expected use-case is that agents would interact with the data via standard CLI tools (grep, cat, ls, find, etc), and there is nothing Chroma-specific in the final implementation (? Do I have that right?).
The author compares the speeds against the Chroma implementation vs. a physical HDD, but I wonder how the benchmark would compare against a Ramdisk with the same information / queries?
I'm very willing to believe that Chroma would still be faster / better for X/Y/Z reason, but I would be interested in seeing it compared, since for many people who already have their data in a hierarchical tree view, I bet there could be some massive speedups by mounting the memory directories in RAM instead of HDD.
We would also be super interested to see that comparison. I agree that there isn't a specific reason why Chroma would be required to build something like this.
We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.
https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...
We started with LLMs when everyone in search was building question answering systems. Those architectures look like the vector DB + chunking we associate with RAG.
Agents ability to call tools, using any retrieval backend, call that into question.
We really shouldn’t start RAG with the assumption we need that. I’ll be speaking about the subject in a few weeks
https://maven.com/p/7105dc/rag-is-the-what-agentic-search-is...
Now, the pendulum on that general concept seems to be swinging the opposite direction where a lot of those people just figured out that you don't need embeddings. That's true, but I'd suggest that people don't overindex on thinking that means embeddings are not actually useful or valuable. Embeddings can be downright magical in what you can build with them, they're just one more tool at your disposal.
You can mix and match these things, too! Indexing your documents into semantically nested folders for agents to peruse? Try chunking and/or summarizing each one, and putting the vectors in sidecar files, or even Yaml frontmatter. Disks are fast these days, you can rip through a lot of files indexed like that before you come close to needing something more sophisticated.
I am active in fandoms and want to create a search where someone can ask "what was that fanfic where XYZ happened?" and get an answer back in the form of links to fanfiction that are responsive.
This is a RAG system, right? I understand I need an actual model (that's something like ollama), the thing that trawls the fanfiction archive and inserts whatever it's supposed to insert into one of these vector DBs, and I need a front-facing thing I write, that takes a user query, sends it to ollama, which can then search the vector DB and return results.
Or something like that.
Is it a RAG system that solves my use case? And if so, what software might I go about using to provide this service to me and my friends? I'm assuming it's pretty low in resource usage since it's just text indexing (maybe indexing new stuff once a week).
The goal is self-hosting. I don't wanna be making monthly payments indefinitely for some silly little thing I'm doing for me and my friends.
I am just a stay at home dad these days and don't have anyone to ask. I'm totally out the tech game for a few years now. I hope that you could respond (or someone else could), and maybe it will help other people.
There's just so many moving parts these days that I can't even hope to keep up. (It's been rather annoying to be totally unable to ride this tech wave the way I've done in the past; watching it all blow by me is disheartening).
The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.
That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.
I hope this helps :-)
1: https://github.com/VectifyAI/PageIndex
I'm not familiar with how most out of the box RAG systems categorize data, but with a database you can index content literally in any way you want. You could do it like a filesystem with hierarchy, you could do it tags, or any other design you can dream up.
The search can be keyword, like grep, or vector, like rag, or use the ranking algorithms that traditional text search uses (tf-idf, BM25), or a combination of them. You don't have to use just the top X ranked documents, you could, just like grep, evaluate all results past whatever matching threshold you have.
Search is an extremely rich field with a ton of very good established ways of doing things. Going back to grep and a file system is going back to ... I don't know, the 60s level of search tech?
Empirically, agents (especially the coding CLIs) seem to be doing so much better with files, even if the tooling around them is less than ideal.
With other custom tools they instantly lose 50 IQ points, if they even bother using the tools in the first place.
E.g. for wikipedia the logical unit would likely be an article. For a book, maybe it's a chapter, or maybe it's a paragraph. You need to design the system around your content and feed the LLM an appropriate logically related set of data.
I'm working on a related challenge which is mounting a virtual filesystem with FUSE that mirrors my Mac's actual filesystem (over a subtree like ~/source), so I can constrain the agents within that filesystem, and block destructive changes outside their repo.
I have it so every repo has its own long-lived agent. They do get excited and start changing other repos, which messes up memory.
I didn't want to create a system user per repo because that's obnoxious, so I created a single claude system user, and I am using the virtual file system to manage permissions. My gmail repo's agent can for instance change the gmail repo and the google_auth repo, but it can't change the rag repo.
Edit: I'm publishing it here. It's still under development. https://github.com/sunir/bashguard
The "ai" npm package includes a root-level docs folder containing .mdx versions of the docs from their site, specific to the version of the package. Their intended AI-assisted developer experience is that people discover and install their ai-sdk skill (via their npx skills tool, which supports discovery and install of skills from most any provider, not just Vercel). The SKILL.md instructs the agent to explicitly ignore all knowledge that may have been trained into its model, and to first use grep to look for docs in node_modules/ai/docs/ before searching the website.
https://github.com/vercel/ai/blob/main/skills/use-ai-sdk/SKI...
The filesystem metaphor works because it preserves heirarchy. Documents have sections, sections have relationships, and those relationships carry meaning that gets lost when you flatten everything into embeddings.
Curious how this handles versioning though. Docs change constantly and stale context fed to an LLM is arguably worse than no context at all.
But in the end, I would expect, that you could add a skill / instructions on how to use chromadb directly
To be honest, I have no idea what chromadb is or how it works. But building an overlay FS seems like quite lot of work.
I also vibed a brainstorming note with my knowledge base system. The initial prompt: """when I read "We replaced RAG with a virtual filesystem for our AI documentation assistant (mintlify.com)" title on HackerNews - the discussion is about RAG, filesystems, databases, graphs - but maybe there is something more fundamental in how we structure the systems so that the LLM can find the information needed to answer a question. Maybe there is nothing new - people had elaborate systems in libraries even before computers - but maybe there is something. Semantic search sounds useful - but knowing which page to return might be nearly as difficult as answering the question itself - and what about questions that require synthesis from many pages? Then we have distillation - an table of content is a kind of distillation targeting the task of search. """ Then I added a few more comments and the llm linked the note with the other pages in my kb. I am documenting that - because there were many voices against posting LLM generated content and that a prompt will be enough. IMHO the prompt is not enough - because the thought was also grounded in the whole theory I gathered in the KB. And that is also kind of on topic here. Anyway - here is the vibed note: https://zby.github.io/commonplace/notes/charting-the-knowled...
Putting Chroma behind a FUSE adapter was my initial thought when I was implementing this but it was way too slow.
I think we would also need to optimize grep even if we had a FUSE mount.
This was easier in our case, because we didn’t need a 100% POSIX compatibility for our read only docs use case because the agent used only a subset of bash commands anyway to traverse the docs. This also avoids any extra infra overhead or maintenance of EC2 nodes/sandboxes that the agent would have to use.
But the idea of spinning up a whole VM to use unix IO primitives is way overkill. Makes way more sense to let the agent spit our unix-like tool calls and then use whatever your prod stack uses to do IO.
github copilot uses rag
[0] https://news.ycombinator.com/item?id=14550060
AgentFS https://agentfs.ai/ https://github.com/tursodatabase/agentfs
Which sounds like a great idea, except that is uses NFS instead of FUSE (note that macFUSE now has a FSKit backend so FUSE seems like the best solution for both Mac and Linux).
We were bitten by our own nomenclature.
Just a small variation in chosen acronym ... may have wrought a different outcome.
Different ways to find context are welcome, we have a long way to go!
$70k?
how about if we round off one zero? Give us $7000.
That number still seems to be very high.
It being dedicated there are no limits on session lifetime and it'd run 16 those sessions no problem, so the real price should be around ~$70/year for that load.
This could be useful.
https://huggingface.co/docs/smolagents/en/examples/rag
Agentic RAG: A More Powerful Approach We can overcome these limitations by implementing an Agentic RAG system - essentially an agent equipped with retrieval capabilities. This approach transforms RAG from a rigid pipeline into an interactive, reasoning-driven process.
The innovation of the blogpost is in the retrieval step.
Not to be "that guy" [0], but (especially for users who aren't already in ChromaDB) -- how would this be different for us from using a RAM disk?
> "ChromaFs is built on just-bash ... a TypeScript reimplementation of bash that supports grep, cat, ls, find, and cd. just-bash exposes a pluggable IFileSystem interface, so it handles all the parsing, piping, and flag logic while ChromaFs translates every underlying filesystem call into a Chroma query."
It sounds like the expected use-case is that agents would interact with the data via standard CLI tools (grep, cat, ls, find, etc), and there is nothing Chroma-specific in the final implementation (? Do I have that right?).
The author compares the speeds against the Chroma implementation vs. a physical HDD, but I wonder how the benchmark would compare against a Ramdisk with the same information / queries?
I'm very willing to believe that Chroma would still be faster / better for X/Y/Z reason, but I would be interested in seeing it compared, since for many people who already have their data in a hierarchical tree view, I bet there could be some massive speedups by mounting the memory directories in RAM instead of HDD.
[0] - https://news.ycombinator.com/item?id=9224
RIP RAG: lasted one year at a skillset that recruiters would list on job descriptions, collectively shut down by industry professionals