Ask HN: Is anyone building a question answering system using the HN corpus?

Today, if someone wants to know what the HN community knows/thinks about a topic, they can either:

A) Search past HN comments on hn.algolia.com, or

B) Post a new 'Ask HN'.

LLMs could provide a new way to find answers within a corpus. These have been described elsewhere, e.g.

- https://github.com/openai/openai-cookbook/blob/main/examples...

- https://news.ycombinator.com/item?id=34477543

I keep expecting someone (maybe minimaxir or simonw?) to post a 'Show HN: Get your question answered by the collective wisdom of HN', but I no one has so far (unless I missed the submission?).

Is someone already working on this?

23 points | by rahimnathwani 889 days ago

5 comments

flemhans 889 days ago
I'd love to do this offline, so I could feed it all my mail. Am I right that it's still going to be a while before we can do that? Or perhaps with a less good model than GPT-3?
[-]
- rahimnathwani 889 days ago
  The things I've seen all use hosted language models. For example https://github.com/jerryjliu/gpt_index depends on LangChain, which wraps APIs from hosted LLMs: https://langchain.readthedocs.io/en/latest/reference/modules...
  AFAIK there's no GPT-3-like LLM that's easy to run at home, because the number of parameters is so so large. Your gaming PC's GPU won't have enough RAM to hold the model. For example, gpt-neox-20b needs about 40GB of RAM: https://huggingface.co/EleutherAI/gpt-neox-20b/discussions/1...
  [-]
  - flemhans 889 days ago
    I wouldn't mind throwing hardware after the problem, but I haven't yet found a "full" guide on how to set it up, it always ends up being something to run in Google Colab or similar.
    [-]
    - ttt3ts 889 days ago
      Hmm. I have full size bloom running on a server in my basement. It can be ran naively with about 400GB of ram. Using used hardware you can get that for about $1200. Still, with CPU inference, you're looking at about 5 mins per response.
      With optimization, I have it down to 140GB of ram. Trying to get it under 120GB without loosing too much accuracy so it can be ran on standard desktop consumer hardware (who's limits are usually 128GB).
      Given the lack of resources I have found I figured the general intrest was low? Maybe I will open source it and do a write up.
      [-]
      - flemhans 888 days ago
        I think that would be an incredibly interesting write-up. There are many applications where a 5-min response time would be more than adequate. It could slowly churn through the inbox, while I'm not looking. Or parse customer emails and suggest replies for a rep to potentially use.
olivierduval 889 days ago
Mmmm... and what about copyright ? I mean: may I dump all of HN and then consider it a book to be sold for my own profit ? And if I can't do it... what is the difference between this idea and using HN to train an LLM ? And what if I don't want my comments be parts of this LLM ? Or what about the "trash" accounts that don't want to be identified ?
Don't get me wrong: the idea could be nice but... ain't it time to think twice about all this before applying the last technological fad ?
[-]
- deadly_syn 888 days ago
  Id argue there isnt an equivilency between an LLM and direct dumping the data straight into a book, theres a significant layer of abstraction there from my understanding. It is entirely legal for you to read HN and paraphrase what you learned in a book, which id argue is a much more fair argument.
- kleer001 889 days ago
  It would easily fall under the auspices of 'fair use'. IANAL.
leobg 889 days ago
Been thinking about this many times. I regularly check what HM things about a specific book, what services HN recommends to perform a particular task, etc..
To the sibling comment that I asked about doing this locally: there’s really no need for an LLM, much less for GPT-3. All you need is, well, attention. Sentence-transformer embeddings. Perhaps even just fastText.
[-]
- rahimnathwani 889 days ago
  AIUI sentence-transformer embeddings work for sentences or short paragraphs. But many comments only make sense in the context of parent/GP comments. This is especially true when a comment answers an earlier question.
  I'm not sure how we'd pack enough context into a single 'sentence', to get a useful embedding for this purpose.
  (I might be wrong of course.)
  [-]
  - leobg 888 days ago
    You keep track of the topic by prepending a summary of the parent. The hierarchical nature of HN should make this somewhat easy.
dyeje 888 days ago
I would assume OpenAI products are already trained on it, amongst many other sites.
gschoeni 889 days ago
Has somebody crawled and made a corpus out of hacker news? Is it maintained?
[-]
- krapp 889 days ago
  HN has an API[0], with a bit of effort you can make one yourself.
  [0]https://github.com/HackerNews/API
- rahimnathwani 889 days ago
  Apparently there are two ways to access it on GCP:
  https://github.com/ashish01/hn-data-dumps