Your website's content -> Q&A bot / chatbot

(github.com)

84 points | by mpaepper 400 days ago

11 comments

  • grogenaut 400 days ago
    Reading the readme makes me think it's only searching the top 4 most likely docs via the embeddings, not the wiki at any time? or am I misunderstanding how this works? With embeddings being close to just term vector matching via dot(?) product?

    So basically get all the sub-prhases/sounds -> vector -> check vector db for closest matching documents -> send to gpt for summarization and answering the quetsion.

    If that's ture wouldn't that have severe limitations with scattered information? I guess it would help you get answers and walk the data better than the "I don't even know the term" problem with google?

    • mpaepper 400 days ago
      Yep, that's the way it's currently implemented in langchain.

      The 4 is a hyperparameter you can change, though, so you could set it to 10 as well.

      The way it works is that it first looks up the N most relevant documents (N being 4 in the default case) in the FAISS store relevant to the question, so it uses distance of embedding vectors for this lookup.

      Then it uses GPT3 to get summaries of the 4 entries related to the question and finally all the summaries together with the question will lead to the answer.

      In doing so, you can trace the source where the answer came from and can also point to that URL in the end.

      When you make N larger it just gets more expensive in terms of your API costs.

      • kacperlukawski 400 days ago
        Looks interesting! Have you considered a proper vector database like Qdrant (https://qdrant.tech)? FAISS runs on a single machine, but if you want to scale things up, then a real database makes it a lot easier. And with a free 1GB cluster on Qdrant Cloud (https://cloud.qdrant.io), you can store quite a lot of vectors. Qdrant is also already integrated with Langchain.
        • Guillaume86 400 days ago
          Probably not very helpful at the scale most people would run this. Even brute forcing the search on CPU gives results in a few ms on small datasets.
          • kordlessagain 400 days ago
            Using something like Weaviate, which can be started in Docker with a one-liner, will give the ability to move away or toward dense vectors by concept. While doing dot product with manual code is fairly easy, using Weaviate to do the lifting (for embeddings as well) makes things super simple.

            https://github.com/FeatureBaseDB/slothbot/blob/slothbot-work...

            • grogenaut 400 days ago
              that means you need docker running and the dependencies explode if you take this approach. I really like the tight dependency tree.
        • mpaepper 400 days ago
          Thanks for the suggestion, but for my fun small experiment, FAISS was more than enough.
    • wmf 400 days ago
      They probably took this approach because it's the only thing you can do with the OpenAI APIs (for now). Training your own corpus will be the way to go once it's possible.
  • layoric 400 days ago
    Nice to have tools like this to wrap up features, definitely makes these types of solutions more accessible, thanks!

    It would be nice to know from your experience if there is a kind of rule of thumb for calculating cost of fine tuning and running a solution like this against a docs site?

    • mpaepper 400 days ago
      I don't have larger scale experience on this at the moment, but I can tell you what I observed during my trials (also see my related blog entry for some more info: https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your...)

      It cost around 0.05$ to create the embeddings for my ~50 blog entries.

      Asking a question in the way I've described it also costs around 0.05$ via the API.

      • danielbln 400 days ago
        0.05$ for a question seems expensive, are you using davinci3 or gpt3.5-turbo?
  • petesergeant 400 days ago
    I tried to do something tangentially similar recently, telling ChatGPT that I'd ask it a question, but rather than a response, I wanted search terms for Wikipedia and Wikidata that I could give it that would have the answer in. The thinking is I'd then be able to provide those to it, and get it to synthesize that data, providing answers that had decent citations in them.

    Perhaps it was the example I chose "flight time from New York to London" but I couldn't really get it to provide sensible search terms for the information it wanted or needed

    • rahimnathwani 400 days ago
      Check out langchain. It implements the ReAct approach, which is similar to what you describe, but without a human needing to be the go-between.
  • limcheekin 397 days ago
    Thanks for sharing the code. What happen when the existing content get updated and new contents created, would it need to create embeddings for all contents again? The current approach is not good as create embeddings cost money? Please see https://github.com/mpaepper/content-chatbot/blob/main/create.... Would it be possible progressively update the vector store?

    Please advise. Thank you.

  • mdotk 400 days ago
    Make this a Wordpress plugin and I'd pay for it
    • itake 400 days ago
      Have you tested the recall of embedding search? I'm not sure about the latest results, but 1-2 years ago, it had 50-70% recall :-/
    • mdmglr 400 days ago
      Along with paying the openAI API fees? Perhaps a cache layer for common qa.
  • wedn3sday 400 days ago
    I would absolutely love to take our internal Wiki and use this against it.
  • jonaraphael 400 days ago
    Awesome work! Thanks for sharing.

    For anyone interested in an audio version that talks to you, that you can get on your site today, my brother put this together a few weeks ago! https://siteguide.ai/

  • nico 400 days ago
    Awesome!

    Are you planning on adding agent/tools support?

    It would be cool to use this with internal data, then allow clients to chat with a bot fine-tunes on their data, but that can also run queries, or get reports for specific dates, or charts, all via tools.

    • mpaepper 400 days ago
      Yes, this will be an interesting next experiment - adding agents with additional tools (also for example access to internal APIs) will be quite powerful.
  • rcarmo 400 days ago
    Curious to see if it can take my entire site content: https://taoofmac.com/static/graph

    Might be a fun weekend experiment.

    • mpaepper 400 days ago
      Woah, that's a huge site!

      Should be fine, though, as it iterates over it, it creates embeddings and then stores them in the FAISS store (https://github.com/facebookresearch/faiss) which was created to handle a large amount of embeddings.

      For the actual queries, it filters it down by the most relevant documents which are closest in the embedding space, so this should work.

      Let me know how it goes!

  • friendlypeg 400 days ago
    How does this handle websites with complicated structure instead of your typical blogposts where ideas are divided neatly into separate paragraph?
    • mpaepper 400 days ago
      Currently, it only splits documents linearly, so if you have information which is written backwards or things like that, it will likely not work so well.
  • mdmglr 400 days ago