Ask HN: How are you doing RAG locally?

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

99 points | by tmaly 17 hours ago

27 comments

esperent 5 minutes ago
[delayed]
CuriouslyC 2 hours ago
Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.
[-]
- jankovicsandras 35 minutes ago
  You can do hybrid search in Postgres.
  Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )
  The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.
- postalcoder 1 hour ago
  I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.
  For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.
- rao-v 47 minutes ago
  Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices?
- ehsanu1 1 hour ago
  I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25.
- itake 1 hour ago
  With AI needing more access to documentation, WDYT about using RAG for documentation retrieval?
- lee1012 2 hours ago
  static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality
tebeka 47 minutes ago
https://duckdb.org/2024/05/03/vector-similarity-search-vss
baalimago 21 minutes ago
I thought that context building via tooling was shown to be more effective than rag in practically every way?
Question being: WHY would I be doing RAG locally?
beret4breakfast 14 minutes ago
For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.
autogn0me 47 minutes ago
https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.
Strift 6 minutes ago
I just use a web server and a search engine.
TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api
cbcoutinho 1 hour ago
The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.
For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.
[0] https://github.com/cbcoutinho/nextcloud-mcp-server
dvorka 1 hour ago
Any suggestion what to use as embeddings model runtime and semantic search in C++?
rahimnathwani 15 hours ago
If your data aren't too large, you can use faiss-cpu and pickle
https://pypi.org/project/faiss-cpu/
[-]
- notyourwork 3 hours ago
  For the uneducated, how large is too large? Curious.
  [-]
  - itake 1 hour ago
    FAISS runs in RAM. If your dataset can't fit into ram, FAISS is not the right tool.
- hahahahhaah 43 minutes ago
  Shoud it be:
  If the total size of your data isn't loo large...?
  Data being a plural gets me.
  You might have small datums but a lot of kilobytes!
init0 2 hours ago
I built a lib for myself https://pypi.org/project/piragi/
[-]
- stingraycharles 1 hour ago
  That looks great! Is there a way to store / cache the embeddings?
lormayna 54 minutes ago
I have done some experiments with nomic embedding through Ollama and ChromaDB.
Works well, but I didn't tested on larger scale
ehsanu1 1 hour ago
Embedded usearch vector database. https://github.com/unum-cloud/USearch
eajr 15 hours ago
Local LibreChat which bundles a vector db for docs.
lee1012 2 hours ago
lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging
motakuk 15 hours ago
LightRAG, Archestra as a UI with LightRAG mcp
jeanloolz 2 hours ago
Sqlite-vec
whattheheckheck 15 hours ago
Anythingllm is promising
nineteen999 11 hours ago
A little BM25 can get you quite a way with an LLM.
jeffchuber 3 hours ago
try out chroma or better yet as opus to!
electroglyph 3 hours ago
simple lil setup with qdrant
pdyc 2 hours ago
sqlite's bm25
ramesh31 11 hours ago
SQLite with FTS5
undergrowth 2 hours ago
Undergrowth.io
[-]
- kristopolous 33 minutes ago
  A new account, named after the thinking you're linking just looks like spam.
  Also I've got no idea what this product does, this is just a generic page of topical ai buzzwords
  Don't tell me what it is, /show me why/ you built it. Then go back and keep that reasoning in, show me why I should care
undergrowth 2 hours ago
undergrowth.io
lee101 2 hours ago
[dead]