Ask HN: Is anyone building a question answering system using the HN corpus?

23 points | by rahimnathwani 11 days ago

5 comments

  • flemhans 11 days ago
    I'd love to do this offline, so I could feed it all my mail. Am I right that it's still going to be a while before we can do that? Or perhaps with a less good model than GPT-3?
    • rahimnathwani 11 days ago
      The things I've seen all use hosted language models. For example https://github.com/jerryjliu/gpt_index depends on LangChain, which wraps APIs from hosted LLMs: https://langchain.readthedocs.io/en/latest/reference/modules...

      AFAIK there's no GPT-3-like LLM that's easy to run at home, because the number of parameters is so so large. Your gaming PC's GPU won't have enough RAM to hold the model. For example, gpt-neox-20b needs about 40GB of RAM: https://huggingface.co/EleutherAI/gpt-neox-20b/discussions/1...

      • flemhans 11 days ago
        I wouldn't mind throwing hardware after the problem, but I haven't yet found a "full" guide on how to set it up, it always ends up being something to run in Google Colab or similar.
        • ttt3ts 10 days ago
          Hmm. I have full size bloom running on a server in my basement. It can be ran naively with about 400GB of ram. Using used hardware you can get that for about $1200. Still, with CPU inference, you're looking at about 5 mins per response.

          With optimization, I have it down to 140GB of ram. Trying to get it under 120GB without loosing too much accuracy so it can be ran on standard desktop consumer hardware (who's limits are usually 128GB).

          Given the lack of resources I have found I figured the general intrest was low? Maybe I will open source it and do a write up.

          • flemhans 9 days ago
            I think that would be an incredibly interesting write-up. There are many applications where a 5-min response time would be more than adequate. It could slowly churn through the inbox, while I'm not looking. Or parse customer emails and suggest replies for a rep to potentially use.
  • olivierduval 11 days ago
    Mmmm... and what about copyright ? I mean: may I dump all of HN and then consider it a book to be sold for my own profit ? And if I can't do it... what is the difference between this idea and using HN to train an LLM ? And what if I don't want my comments be parts of this LLM ? Or what about the "trash" accounts that don't want to be identified ?

    Don't get me wrong: the idea could be nice but... ain't it time to think twice about all this before applying the last technological fad ?

    • deadly_syn 9 days ago
      Id argue there isnt an equivilency between an LLM and direct dumping the data straight into a book, theres a significant layer of abstraction there from my understanding. It is entirely legal for you to read HN and paraphrase what you learned in a book, which id argue is a much more fair argument.
    • kleer001 11 days ago
      It would easily fall under the auspices of 'fair use'. IANAL.
  • leobg 11 days ago
    Been thinking about this many times. I regularly check what HM things about a specific book, what services HN recommends to perform a particular task, etc..

    To the sibling comment that I asked about doing this locally: there’s really no need for an LLM, much less for GPT-3. All you need is, well, attention. Sentence-transformer embeddings. Perhaps even just fastText.

    • rahimnathwani 11 days ago
      AIUI sentence-transformer embeddings work for sentences or short paragraphs. But many comments only make sense in the context of parent/GP comments. This is especially true when a comment answers an earlier question.

      I'm not sure how we'd pack enough context into a single 'sentence', to get a useful embedding for this purpose.

      (I might be wrong of course.)

      • leobg 10 days ago
        You keep track of the topic by prepending a summary of the parent. The hierarchical nature of HN should make this somewhat easy.
  • dyeje 9 days ago
    I would assume OpenAI products are already trained on it, amongst many other sites.
  • gschoeni 10 days ago
    Has somebody crawled and made a corpus out of hacker news? Is it maintained?