Love the idea, I have thought about this before and the main issue is training my self to click an extension to check. If an icon flashed when there was a HN submission (ideally only if I didn't come directly from HN since I already know) it would be way more useful.
That said we all know the big issue with that is privacy. I don't want an extension sending every url I visit to any service (directly to the API or through some third-party). I've mulled over this issue before and I'm not sure how much space it would take up to store a list of urls that have been submitted to HN (Maybe keep 1 month plus all submissions that got over 100 upvotes or something) and check against that local list.
Then, and only then, when you get a match you can call out to the Algolia API to get the HN url (or store that as well depending on size).
I have no idea, off the top of my head, what the storage requirements for this look like but I don't think they would be huge. The other issue (which I want to look into the source to see how this handles it) is the stupid social/ads tracking params that are added to URLs. Maybe there is a good list of these that you can remove (from both the current URL and the HN submission) so you can see if it's the same base URL.
Every day there are at most around a hundred articles that generate any meaningful discussion at all (see past). Let’s say each article takes up <=200 bytes (URL + title + some stats, a very generous limit actually), then one year worth of data is at most ~7 MB. That might be a bit too much but not by far. If you gate submissions by votes/comments threshold as a function of time, it’s conceivable to store metadata of all the good discussions on HN within 20 MB or even 10 MB.
Chrome and Firefox extensions can request the unlimitedStorage permission, btw. (Chrome has a 5MB default limit, Firefox doesn’t seem to have one.)
Or you could use a privacy-preserving lookup API, but that might be too much traffic. A Bloom filter could be downloaded locally and is probably a better solution.
> I'm not sure how much space it would take up to store a list of urls that have been submitted to HN
I recently was wondering how much space this would take up myself. After a lot of searching, I found this reddit post, which links to an archive of Hackernews. It contains data from HN from late 2006 until mid 2018 and totals just over 2 gigabytes. These dumps contain all comments, job postings, polls, poll options, and stories.
I did some super quick analysis of the 2018-05 archive (the latest provided by this source). I found that there were 237,646 total items, and only 32,473 of those are stories. That's only ~14%. Assuming the ratio of stories to non-stories has been constant for the entire dataset, that's only 280 megabytes for the entire 2006 to 2018 set.
That data can further be shrunk by removing extraneous information from each story in the data. Mirroring the HN api, it has the following pieces of data for each item: author username, id, date retrieved, score, time posted, title, type, url, and whether its dead, how many descendants it has, and what items are its kids. I didn't attempt to reduce the data to only contain links, but I imagine it would be significantly reduce the size.
Once you've reduced the data down to a list of urls, I imagine it can be reduced even more by removing duplicate links.
Depending on the average size of the urls, it's not unreasonable to think that taking a hash of each of the urls would result in a smaller set of data.
On top of that, there's wonderful text compression, but I don't have the numbers on how much that would reduce the size of data.
I was curious so I downloaded a list of id-url pairs from here [0]. It's CSV formatted and contains 1_960_207 entries (last update being 22 feb 2019). It is 134MiB uncompressed and 35MiB compressed using xz, so definitely storable in a web extension.
IDs being integers smaller than 10_000_000, they can be stored in 3 bytes and using a 64 bits hash function is enough (using this approximation [1] with k=2_000_000 and N=2^64 gives p=1,08e-7) which accounts for 22MB for 2 million entries. Stats on duplicates would be needed to know the impact of bundling identical hashes together. Definitely doable!
Keeping up-to-date would be harder, having a server querying the API to collect and distribute the day-by-day data to every extension-user is probably the best option.
Comparing hashes would help a bit on both anonymity and size concerns.
I also think, in a majority of cases, one could remove all of the query parameters from a URL and still have the same page. I'm not 100% confident about this though
That's a really good idea! Thanks for mentioning that, it's something I've used before (coded support for) but it completely slipped my mind when thinking about this.
This is awesome. I really love this idea, and love the implementation. Works great and I think will be really useful or at least interesting (can't tell yet).
One piece of feedback-- I'd love a mode where the extension notifies me there are HN threads that pass a certain threshold (e.g. number of upvotes, number of comments, etc.) for every page I visit. This is less privacy preserving, but I'd be willing to make that trade-off in exchange for useful information being surfaced to me opportunistically.
Downvote if you will, all the same, I find most HN discussions to be of relatively low-value, and also it's not easy to vet whether or not someone's credentials align with what they're writing. I come across interesting links on HN all the time, and I wish I could have something to tell me "Oh, there's a LtU user with hundreds of posts discussing this with links to papers and proofs."
My personal perspective is just that way, I don't see myself coming across anything and thinking "Gee, I wonder what HN thinks about this."
I love the idea though. I wish browsers didn't suck so much. I wish Opera had won more, and maybe we'd have lots of different browsers, infinitely configurable like emacs/vim, with my whole little customized universal browsing tool. Extensions are an adequate compromise, it's just that the kind of person who thinks this stuff up, could so much MORE if browsers weren't so limited.
Just to offer my perspective, and thank you for yours - I don't feel like HN is a hive mind and a single-voiced consensus, nor that I need to check if 'HN has endorsed this project or the idea proposed by this article' as the use case for this extension. Instead, I'm instantly recognising the value in having a quick link to further information and analysis about a given page, if it exists. This is because, to me, HN comment sections can be a wealth of information in their own right, very often.
Just installed the add-on, let's see how useful it becomes. :)
I've been thinking of the same. I spend a lot of my time in forums and would like to be able to discriminate users based on certain parameters.
I envision it as some sort of extension that analyzes the users on the current discussion thread, visits their profiles, analyzes it (post history, stats, etc) and decorates the users handle on my current page.
Performance shouldn't be too bad using caching and prioritization.
What value adds were you envisioning specifically?
Sorry, I didn't see this the other day, I'm not really a power-user.
I guess if I could have any features I want, it'd be this:
* Topics, perhaps categorized by keywords.
Maybe, if someone uses the word "TensorFlow" I'd like to know if they have other posts that have scored well with that word in them.
* Similarly, I'd like to know what topics that user are more likely to post in. If a user only ever posts in threads that contain the word "SomeSmallStartupTheyAreClearlyShillingFor" then, I'd like to know that. I find that in practice. Algorithmically, I think its not too hard to separate these two categories and distinguish high-quality users, because the way posts can be scored here on HN.
Really though, I have been thinking about building something like this, maybe a service reading thousands of RSS feeds and keeping track of comment sections in blogs and forum threads, and just compiling these webs of influence for certain links. Like a search engine, except it'd be specialized for discovering high-quality conversations.
I made a bookmarklet for Firefox that opens the HN discussion in a new tab (if there is one) and offers to submit if there isn't. It's very quick and dirty but it does the trick.
It opens the first result from the search API, can be modified to open all of them if you want.
javascript:(()=>{const w=window.open();fetch(`https://hn.algolia.com/api/v1/search?tags=story&query=${encodeURIComponent(window.location.href)}`).then(a => a.json()).then(a=>{const c=a.hits.filter(b=>b.url===window.location.href)[0];if(c){w.location.replace(`https://news.ycombinator.com/item?id=${c.objectID}`)}else{w.confirm('Not on HN. Submit?') ? w.location.replace(`https://news.ycombinator.com/submitlink?u=${encodeURIComponent(document.location)}&t=${encodeURIComponent(document.title)}`):w.close();}})})()
Interestingly I had to open the tab before getting the search results, it seems there is an exemption to the popup blocker for bookmarklets but only synchronously.
Also think it would be great to see an indicator of whether or not there are hits, perhaps not a flash as thats pretty invasive but a number of hits, kinda like SMS or email count on icons.
The privacy thing is also making me flinch, an idea could be to disable unless clicked, when I find an interesting page, product or application I often wonder if its featured on HN
Simple and great! I consume lot of content from HN and bookmark posted links often. As everyone here knows hn comments sometimes contribute more to the topic than article itself (so I add them to favorites). Now both of them linked. Thanks
Your extension looks good! I made a similar extension with clojurescript 4 years ago, using algolia api too. It's not intrusive and only look up when you click. Check out the code here:
https://github.com/jazzytomato/hnlookup
That's a really cool idea! I added it. I am mostly curious to learn more about sites using HN as a way to market their products or to know more about the context in which a product is discussed.
Love it! One suggestion, at risk of promoting feature creep/visual bloat: maybe go into those thread and pull the top comments (ideally, the top comments over all discussions), and have those pop us as the first thing I see on the drop-down, instead of just links to the discussions?
Nice, I've wanted something like this for a while. HN often has substantive comments on writing in the internet, so I often find myself checking if something interesting I read has been submitted to HN before.
To me HN Algolia has been a one-stop shop for everything related to search in HN. I always have a browser tab that has HN Algolia opened up for any kind of research. I’d love if this extension could be extended to include HN Algolia too.
Your comment made me dig in a little more. I was wrong, it is only fetching the current tab, although it wouldn't need more permissions to see all the tabs.
These `active` and `currentWindow` parameters to query() [2] restrict the results to the current tab. If I remove those parameters and run in DevTools, I seem to get a full tab listing.
Even without `active` and `currentWindow` parameters the extension cannot get urls and titles from other tabs because it has only the `activeTab`[1] permission declared in the manifest. You need more powerful permission for that.
I think with the `activeTab` permission you still get the an object for every tab other the active one, but without access to `url`, `title` and `faviconUrl` properties.
Thanks for checking out anyway. I built this tool especially because all of the others already available were a privacy nightmare.
That said we all know the big issue with that is privacy. I don't want an extension sending every url I visit to any service (directly to the API or through some third-party). I've mulled over this issue before and I'm not sure how much space it would take up to store a list of urls that have been submitted to HN (Maybe keep 1 month plus all submissions that got over 100 upvotes or something) and check against that local list.
Then, and only then, when you get a match you can call out to the Algolia API to get the HN url (or store that as well depending on size).
I have no idea, off the top of my head, what the storage requirements for this look like but I don't think they would be huge. The other issue (which I want to look into the source to see how this handles it) is the stupid social/ads tracking params that are added to URLs. Maybe there is a good list of these that you can remove (from both the current URL and the HN submission) so you can see if it's the same base URL.
Chrome and Firefox extensions can request the unlimitedStorage permission, btw. (Chrome has a 5MB default limit, Firefox doesn’t seem to have one.)
https://en.wikipedia.org/wiki/Bloom_filter
I recently was wondering how much space this would take up myself. After a lot of searching, I found this reddit post, which links to an archive of Hackernews. It contains data from HN from late 2006 until mid 2018 and totals just over 2 gigabytes. These dumps contain all comments, job postings, polls, poll options, and stories.
I did some super quick analysis of the 2018-05 archive (the latest provided by this source). I found that there were 237,646 total items, and only 32,473 of those are stories. That's only ~14%. Assuming the ratio of stories to non-stories has been constant for the entire dataset, that's only 280 megabytes for the entire 2006 to 2018 set.
That data can further be shrunk by removing extraneous information from each story in the data. Mirroring the HN api, it has the following pieces of data for each item: author username, id, date retrieved, score, time posted, title, type, url, and whether its dead, how many descendants it has, and what items are its kids. I didn't attempt to reduce the data to only contain links, but I imagine it would be significantly reduce the size.
Once you've reduced the data down to a list of urls, I imagine it can be reduced even more by removing duplicate links.
Depending on the average size of the urls, it's not unreasonable to think that taking a hash of each of the urls would result in a smaller set of data.
On top of that, there's wonderful text compression, but I don't have the numbers on how much that would reduce the size of data.
IDs being integers smaller than 10_000_000, they can be stored in 3 bytes and using a 64 bits hash function is enough (using this approximation [1] with k=2_000_000 and N=2^64 gives p=1,08e-7) which accounts for 22MB for 2 million entries. Stats on duplicates would be needed to know the impact of bundling identical hashes together. Definitely doable!
Keeping up-to-date would be harder, having a server querying the API to collect and distribute the day-by-day data to every extension-user is probably the best option.
[0]: https://console.cloud.google.com/marketplace/product/y-combi... [1]: https://preshing.com/20110504/hash-collision-probabilities/
I also think, in a majority of cases, one could remove all of the query parameters from a URL and still have the same page. I'm not 100% confident about this though
Although not every page offers it, for those that do, comparing the canonical link [0] should be pretty robust.
[0] https://en.wikipedia.org/wiki/Canonical_link_element
One piece of feedback-- I'd love a mode where the extension notifies me there are HN threads that pass a certain threshold (e.g. number of upvotes, number of comments, etc.) for every page I visit. This is less privacy preserving, but I'd be willing to make that trade-off in exchange for useful information being surfaced to me opportunistically.
Thanks for making it!!
Downvote if you will, all the same, I find most HN discussions to be of relatively low-value, and also it's not easy to vet whether or not someone's credentials align with what they're writing. I come across interesting links on HN all the time, and I wish I could have something to tell me "Oh, there's a LtU user with hundreds of posts discussing this with links to papers and proofs."
My personal perspective is just that way, I don't see myself coming across anything and thinking "Gee, I wonder what HN thinks about this."
I love the idea though. I wish browsers didn't suck so much. I wish Opera had won more, and maybe we'd have lots of different browsers, infinitely configurable like emacs/vim, with my whole little customized universal browsing tool. Extensions are an adequate compromise, it's just that the kind of person who thinks this stuff up, could so much MORE if browsers weren't so limited.
Just installed the add-on, let's see how useful it becomes. :)
I envision it as some sort of extension that analyzes the users on the current discussion thread, visits their profiles, analyzes it (post history, stats, etc) and decorates the users handle on my current page.
Performance shouldn't be too bad using caching and prioritization.
What value adds were you envisioning specifically?
I guess if I could have any features I want, it'd be this:
* Topics, perhaps categorized by keywords.
Maybe, if someone uses the word "TensorFlow" I'd like to know if they have other posts that have scored well with that word in them.
* Similarly, I'd like to know what topics that user are more likely to post in. If a user only ever posts in threads that contain the word "SomeSmallStartupTheyAreClearlyShillingFor" then, I'd like to know that. I find that in practice. Algorithmically, I think its not too hard to separate these two categories and distinguish high-quality users, because the way posts can be scored here on HN.
Really though, I have been thinking about building something like this, maybe a service reading thousands of RSS feeds and keeping track of comment sections in blogs and forum threads, and just compiling these webs of influence for certain links. Like a search engine, except it'd be specialized for discovering high-quality conversations.
What is LtU?
The design of What Hacker News Says is really nice though.
[0] https://chrome.google.com/webstore/detail/kiwi-conversations...
Edit: It seems the backticks mess something up in HN formatting. Code here: https://gist.github.com/llimos/ee818bcb3060adc8469f4978c654a...
Your design is much nicer tho
I'm also using [0] which displays mentions of a site on reddit.
(And while you're at it, one [1] that replaces youtube comments with reddits comments from the subreddit threads where a video was posted to.)
[0] https://chrome.google.com/webstore/detail/reddit-check/mllce...
[1] https://chrome.google.com/webstore/detail/karamel-view-reddi...
The privacy thing is also making me flinch, an idea could be to disable unless clicked, when I find an interesting page, product or application I often wonder if its featured on HN
https://chrome.google.com/webstore/detail/hacker-news-lookup...
https://news.ycombinator.com/item?id=16316374
https://github.com/jdormit/looped-in
I would much prefer if it only looked up the current tab.
A more private design might fetch the top N results from algolia.com and only search through them locally.
That being said, this is cool! Thanks for sharing.
Wait, how's that possible? The extension doesn't even have permission to get urls from tabs that are not the active one...
In popup.js[1]:
These `active` and `currentWindow` parameters to query() [2] restrict the results to the current tab. If I remove those parameters and run in DevTools, I seem to get a full tab listing.[1]: https://github.com/pinoceniccola/what-hn-says-webext/blob/ma...
[2]: https://developer.chrome.com/extensions/tabs#method-query
I think with the `activeTab` permission you still get the an object for every tab other the active one, but without access to `url`, `title` and `faviconUrl` properties.
Thanks for checking out anyway. I built this tool especially because all of the others already available were a privacy nightmare.
[1]: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...