chatbot

(github.com)

84 points | by mpaepper 783 days ago

11 comments

grogenaut 783 days ago
Reading the readme makes me think it's only searching the top 4 most likely docs via the embeddings, not the wiki at any time? or am I misunderstanding how this works? With embeddings being close to just term vector matching via dot(?) product?
So basically get all the sub-prhases/sounds -> vector -> check vector db for closest matching documents -> send to gpt for summarization and answering the quetsion.
If that's ture wouldn't that have severe limitations with scattered information? I guess it would help you get answers and walk the data better than the "I don't even know the term" problem with google?
[-]
- mpaepper 783 days ago
  Yep, that's the way it's currently implemented in langchain.
  The 4 is a hyperparameter you can change, though, so you could set it to 10 as well.
  The way it works is that it first looks up the N most relevant documents (N being 4 in the default case) in the FAISS store relevant to the question, so it uses distance of embedding vectors for this lookup.
  Then it uses GPT3 to get summaries of the 4 entries related to the question and finally all the summaries together with the question will lead to the answer.
  In doing so, you can trace the source where the answer came from and can also point to that URL in the end.
  When you make N larger it just gets more expensive in terms of your API costs.
  [-]
  - kacperlukawski 783 days ago
    Looks interesting! Have you considered a proper vector database like Qdrant (https://qdrant.tech)? FAISS runs on a single machine, but if you want to scale things up, then a real database makes it a lot easier. And with a free 1GB cluster on Qdrant Cloud (https://cloud.qdrant.io), you can store quite a lot of vectors. Qdrant is also already integrated with Langchain.
    [-]
    - Guillaume86 783 days ago
      Probably not very helpful at the scale most people would run this. Even brute forcing the search on CPU gives results in a few ms on small datasets.
      [-]
      - kordlessagain 782 days ago
        Using something like Weaviate, which can be started in Docker with a one-liner, will give the ability to move away or toward dense vectors by concept. While doing dot product with manual code is fairly easy, using Weaviate to do the lifting (for embeddings as well) makes things super simple.
        https://github.com/FeatureBaseDB/slothbot/blob/slothbot-work...
        [-]
        grogenaut 782 days ago
        that means you need docker running and the dependencies explode if you take this approach. I really like the tight dependency tree.
    - mpaepper 783 days ago
      Thanks for the suggestion, but for my fun small experiment, FAISS was more than enough.
- wmf 783 days ago
  They probably took this approach because it's the only thing you can do with the OpenAI APIs (for now). Training your own corpus will be the way to go once it's possible.
layoric 783 days ago
Nice to have tools like this to wrap up features, definitely makes these types of solutions more accessible, thanks!
It would be nice to know from your experience if there is a kind of rule of thumb for calculating cost of fine tuning and running a solution like this against a docs site?
[-]
- mpaepper 783 days ago
  I don't have larger scale experience on this at the moment, but I can tell you what I observed during my trials (also see my related blog entry for some more info: https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your...)
  It cost around 0.05$ to create the embeddings for my ~50 blog entries.
  Asking a question in the way I've described it also costs around 0.05$ via the API.
  [-]
  - danielbln 782 days ago
    0.05$ for a question seems expensive, are you using davinci3 or gpt3.5-turbo?
petesergeant 783 days ago
I tried to do something tangentially similar recently, telling ChatGPT that I'd ask it a question, but rather than a response, I wanted search terms for Wikipedia and Wikidata that I could give it that would have the answer in. The thinking is I'd then be able to provide those to it, and get it to synthesize that data, providing answers that had decent citations in them.
Perhaps it was the example I chose "flight time from New York to London" but I couldn't really get it to provide sensible search terms for the information it wanted or needed
[-]
- rahimnathwani 783 days ago
  Check out langchain. It implements the ReAct approach, which is similar to what you describe, but without a human needing to be the go-between.
limcheekin 780 days ago
Thanks for sharing the code. What happen when the existing content get updated and new contents created, would it need to create embeddings for all contents again? The current approach is not good as create embeddings cost money? Please see https://github.com/mpaepper/content-chatbot/blob/main/create.... Would it be possible progressively update the vector store?
Please advise. Thank you.
mdotk 783 days ago
Make this a Wordpress plugin and I'd pay for it
[-]
- itake 783 days ago
  Have you tested the recall of embedding search? I'm not sure about the latest results, but 1-2 years ago, it had 50-70% recall :-/
- mdmglr 783 days ago
  Along with paying the openAI API fees? Perhaps a cache layer for common qa.
wedn3sday 783 days ago
I would absolutely love to take our internal Wiki and use this against it.
[-]
- mpaepper 783 days ago
  Go for it, it's quite easy to do, you only need to adapt the `https://github.com/mpaepper/content-chatbot/blob/main/create...` file a bit to match the way your data is represented and then you are good to go.
jonaraphael 782 days ago
Awesome work! Thanks for sharing.
For anyone interested in an audio version that talks to you, that you can get on your site today, my brother put this together a few weeks ago! https://siteguide.ai/
nico 783 days ago
Awesome!
Are you planning on adding agent/tools support?
It would be cool to use this with internal data, then allow clients to chat with a bot fine-tunes on their data, but that can also run queries, or get reports for specific dates, or charts, all via tools.
[-]
- mpaepper 783 days ago
  Yes, this will be an interesting next experiment - adding agents with additional tools (also for example access to internal APIs) will be quite powerful.
rcarmo 782 days ago
Curious to see if it can take my entire site content: https://taoofmac.com/static/graph
Might be a fun weekend experiment.
[-]
- mpaepper 782 days ago
  Woah, that's a huge site!
  Should be fine, though, as it iterates over it, it creates embeddings and then stores them in the FAISS store (https://github.com/facebookresearch/faiss) which was created to handle a large amount of embeddings.
  For the actual queries, it filters it down by the most relevant documents which are closest in the embedding space, so this should work.
  Let me know how it goes!
friendlypeg 783 days ago
How does this handle websites with complicated structure instead of your typical blogposts where ideas are divided neatly into separate paragraph?
[-]
- mpaepper 783 days ago
  Currently, it only splits documents linearly, so if you have information which is written backwards or things like that, it will likely not work so well.
mdmglr 783 days ago
See also: https://github.com/whitead/paper-qa