My main source right now is twitter with arxiv links retweeted most by people I follow.
My favourite ones are:
https://twitter.com/arxiv_cs_cl
https://twitter.com/papers_daily
Where do you mainly find good papers?
My main source right now is twitter with arxiv links retweeted most by people I follow.
My favourite ones are:
https://twitter.com/arxiv_cs_cl
https://twitter.com/papers_daily
Where do you mainly find good papers?
7 comments
It is powered by transformer models and sbert.net, these are used to assign articles to 20 clusters generated daily, i see the top 15 from each cluster. This does a reasonable job of handling a diverse feed that includes CS abstracts, trade publication article, sports news, etc. I have high satisfaction in days that the system gets a lot of articles (peaks on Thorsday) but less on the weekends, sometimes I backfill high-scoring articles from last week then.
I tried using fine-tuned BERT-like models for classification and got them to equal the performance of the embedding-based system after a huge amount of work and a much longer training time. My problem is pretty noisy and there is some limit to how high i can get the AUC.
Interested in your embedding based system - is that embedding layer + neural net?
Sounds very cool overall:)
The embedding system uses a probability-calibrated SVM. My average AUC is 0.77, I hear TikTok gets in the low 80’s and they are using collaborative filtering. I got 0.72 with a bag-of-words and logistic regression model.
From a product standpoint it’s got the disadvantage that it takes about 1000 judgements to really get good, right now I am training over the last 40 days of data because it doesn’t really get better with more than that which is good news because the compute and storage are nicely bounded.
On an unrelated note I realized recently that the 'bag' in bag-of-words is another name for the multiset data structure... Which makes sense when you think about the text as being a _set_ of tokens which can appear _multiple_ times.
I was talking to somebody about the potential as an open source project and came to the conclusion that it's a research project right now but my research projects are more solid than average. I know I'm not afraid to demo it because I run it every day and it spins like a top.
If you want to chat about it look up my profile and send me an email.
The last time I was interested in a topic (tree segmentation) I used elicit.org * and I found it really nice to find new papers.
* From the FAQ:
If you ask a question, Elicit will show relevant papers and summaries of key information about those papers in an easy-to-use table.