Ask HN: How does archive.is bypass paywalls?

If it simply visits sites, it will face a paywall too. If it identifies itself as archive.is, then other people could identify themselves the same way.

133 points | by flerovium 718 days ago

26 comments

RicoElectrico 718 days ago
Nice try, media company employee ;)
/jk
[-]
- PTOB 717 days ago
  My sentiments exactly.
fxtentacle 717 days ago
Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.
I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.
[-]
- panopticon 717 days ago
  > Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.
  I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?
  [-]
  - phoenixreader 717 days ago
    Exactly. It also would not be difficult for website operators to embed hidden user info in their served pages, thereby finding out the archive.is account. This approach seems risky for archive.is.
  - hda111 717 days ago
    They could just copy the div with the content over to evade detection of the website’s owner
- tivert 717 days ago
  > Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.
  I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.
  [-]
  - Stagnant 717 days ago
    I don't think anyone knows who runs archive.is. I've tried looking into it a couple of times in the past but there is surprisingly little information to be found. It must cost thousands if not tens of thousands a month to host all that data and AFAIK they do not monetize it in any way. From what I gather it probably is some Russian person as there were some old stackoverflow conversations regarding the site that lead to an empty github account with a russian name. Also back in 2015 the site owner blocked all Finnish ip addresses due to "an incident at the border"[1]. Finnish IPs have since been unblocked. It appears the site owner somehow thought he could end up in EU wide blacklist which seemed like very conspiratorial thinking from him.
    1: https://archive.is/Pum1p
    [-]
    - killingtime74 717 days ago
      When I visit each page has three ads, Left right and bottom. Maybe you have an ad blocker?
    - Swiftness6022 717 days ago
      [dead]
- Hamuko 717 days ago
  Would it be possible to check if archive.is is logged into a newspaper site by archiving one of the user management pages?
- hoofhearted 717 days ago
  Negative. I used to assume this as well, but they somehow also bypass local paywalls which have gotten me temporarily banned from r/Baltimore lol.
  They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?
  [-]
  - jrochkind1 717 days ago
    Wait, you got banned from /r/Baltimore for posting archive.is links there? That's against the rules there? I would not have known that myself! (Also a Baltimorean).
    [-]
    - hoofhearted 717 days ago
      I even tried to convince them to be in the mindset that Paul Graham created Hacker News to get more mindshare on YC. He gave the idea of Reddit to the 3 brilliant Ivy League founders who applied to YC with a basic GMAIL extension I think that copied emails or something.
      So I tried convincing them that if it’s okay here on PG’s creation, then it should be okay on his other creation.
    - hoofhearted 717 days ago
      Yeah! Hahah
      I thought knowledge was free, and the Baltimore sun sucked anyways. They charge money and don’t even write hood stuff anymore. They laid off a bunch of people, and moved printing to Delaware. My bet is the next step is that they announce they are shutting down all Locust Point operations, and are selling out so that Kevin Plank can build some new buildings there.
      I think I had to appeal my ban with a mod, and they mentioned how it’s posted all over by the auto bot that sharing links to websites that bypass paywalls are against their subreddit rules :(
      I even tried an official proposal to r/Baltimore to reconsider and life that rule. The general consensus on the poll was that people felt that the Baltimore sun and the writers should be getting paid for their work, and I shouldn’t be bypassing their paywalls lol.
      [-]
      - jrochkind1 717 days ago
        You did it ONCE and got banned?
        I still can't find anything in the subreddit rules that clearly says this. (Not that most people read the rules first). Why don't they just add it to the rules?
        This is one of the things I dislike most about reddit, it seems to be common to ban people for a single rules violation of a poorly-documented unstated rule.
        My main problem with reading the Sun online is it has so much adware that my browser slows to a crawl and sometimes crashes when I try to read it!
  - dev_0 717 days ago
    [dead]
- flerovium 717 days ago
  But is it true? What evidence is there?
  This is a plausible explanation but is it true?
- stevefan1999 717 days ago
  So scihub but for newspapers
throwaway81523 717 days ago
Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.
Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).
I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.
[-]
- 1ark 717 days ago
  They are probably just checking headers such as user agent and cookies. Would copy whatever your normal browser sends and put it in the urllib.request. If that doesn’t work, then it is likely more sophisticated.
  [-]
  - throwaway81523 717 days ago
    I will try that, but a quick look at the error page makes me think it tries to run a javascript blob.
    [-]
    - ksala_ 717 days ago
      They're just checking the user agent
      $ curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: curl/7.54.1' | head -1 HTTP/2 403 $curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' | head -1 HTTP/2 200
      One "trick" is that Firefox (and I assume Chrome?) allow you to copy a request as curl - then you can just see if that works in the terminal, and if it does you can binary search for the required headers.
    - chrisco255 717 days ago
      It probably does. But there are better modern tools like headless Chrome / Puppeteer that can fully render a page with scripts.
      [-]
- withinboredom 717 days ago
  Sounds like an ADA lawsuit waiting to happen. I'd send the editor an email explaining how they've reduced usability of the site; especially if you're a paying customer.
World177 717 days ago
I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.
[1] https://i.imgur.com/lyeRTKo.png
[2] https://i.imgur.com/IlBhObn.png
[-]
- jrochkind1 717 days ago
  That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.
  [-]
  - World177 716 days ago
    It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.
    [-]
    - jrochkind1 712 days ago
      Definitely incorrect, the paths in the robots.txt are prefixes, so `/` means anything starting with `/`, that is, everything. Look up the specifications for the robots.txt for more information! (Or, for instance, look up how you'd block the whole site in robots.txt if you wanted to!)
    - KomoD 714 days ago
      No, / means the entire site, since root and anything lower.
- flerovium 717 days ago
  That's an interesting idea, but is it true?
  [-]
  - World177 716 days ago
    Websites usually want their pages indexed for search engines, as it increases the traffic they receive. They also often try to allow archival usage. The robots.txt usually has defined user agents used by search engines defined, as one purpose is to reduce load on the website by not indexing pages that do not need to be indexed.
    It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)
    edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.
    Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.
chrisco255 717 days ago
Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.
[-]
- strunz 717 days ago
  Your missing the point of "how does it bypass firewalls"
  [-]
  - hoofhearted 717 days ago
    Surprisingly, nobody has mentioned this here yet. I’m thinking the key to this is SEO, SERP’s, and newspapers wanting Google to find and index their content.
    This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.
    To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..
    If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?
    I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.
  - chrisco255 717 days ago
    Sorry, thought it was obvious. Since it's using backend infrastructure to fetch the assets, it can crawl them as a bot in the same way that search engines do, without allowing cookies to be saved. Since scripts are often involved in the full rendering of a page, it clearly does allow for the scripts to load before snapshotting the DOM. But only the DOM and the assets and styles are preserved. Scripts are not. Most paywalls are simple scripts. If you disable JS and cookies, you'll often see the full text of an article.
    [-]
    - killingtime74 717 days ago
      Some paywalls don't hide the content with JavaScript. It's just not there. They make you pay and then redirect you to another page.
    - JeremyNT 716 days ago
      I browse with scripts disabled by default and while some paywalls rely on js to block interactions after load many simply send only partial content and a login dialog.
      archive.is does "something" to get the full page for sites that specifically do not send all the content to non-logged-in user agents, and it's definitely different / more complex than simply running noscript.
    - joegibbs 717 days ago
      There are a lot of paywalls that are done server-side - for instance the Herald Sun, which is one of the biggest newspapers in Australia, does it like this. Even if you check the responses there's nothing in them but links to subscribe and a brief intro to the article.
  - wackget 717 days ago
    paywalls*
  - sshine 717 days ago
    [flagged]
lcnPylGDnU4H9OF 718 days ago
I think a browser extension which people who have access to the article use to send the article data to the archive server.
[-]
- phoenixreader 717 days ago
  You mean the pages are crowdsourced? I don’t think so because many pages are archived only upon request. If I ask to archive a new page, archive.is provides it very quickly. This is not possible if the archive is built from crowdsourced data.
- AlbertCory 718 days ago
  That is how RECAP works ("Pacer" spelled backwards).
  In that case, the government is fine with it.
  [-]
  - wolverine876 717 days ago
    I think that's how Sci-hub works, at least at some time in the past.
    [-]
    - JCharante 717 days ago
      I thought people would send their journal credentials to Sci-hub
- flerovium 718 days ago
  Can you explain? Who has purchased the subscription? I'm sure there's a no-redistribution clause in the subscription agreement.
  [-]
  - lcnPylGDnU4H9OF 718 days ago
    The person who installed the browser extension would be paying the subscription and ignoring said clause.
    [-]
    - riku_iki 718 days ago
      curious if eventually companies with start watermarking articles and catch and sue extension users.
      [-]
      - lcnPylGDnU4H9OF 718 days ago
        I suspect most content publishers would go to the source. If there are people who are already willing to pay for subscriptions and ignore the terms of those subscriptions, it's not much of a stretch that they'll ignore the fact that they got their subscription cancelled once (or twice, or however many times). The publisher would more likely see results taking legal action against the archivist.
        [-]
        dwater 717 days ago
        It didn't stop the RIAA from suing loads of people over downloading mp3s in the past 2 decades, claiming damages of thousands of dollars per song the individual downloaded.
        [-]
        riku_iki 717 days ago
        in this case (archive.is) they have stronger case, since many people who potentially could buy subscription read it on archive.is because extension user violated terms of subscription.
        Also, extension likely has also terms of usage prohibiting uploading copyrighted content shifting liability on users.
        sam0x17 717 days ago
        *uploaded
        They went after seeders
        [-]
        riku_iki 717 days ago
        downloaders also received legal letters.
    - flerovium 718 days ago
      But what is the relationship between archive.is and the user who installed the extension?
      [-]
      - phneutral26 718 days ago
        The user helps free the Internet by using archive.is as an openly accessible backup platform.
      - inconceivable 718 days ago
        dude... haha it's a random person on the internet who is doing it for free.
      - lcnPylGDnU4H9OF 718 days ago
        They (archive.is) would have built the extension to send the current page content to their servers and the user would have installed it so they can archive internet pages. https://help.archive.org/help/save-pages-in-the-wayback-mach... (item 2)
        [-]
        Stagnant 717 days ago
        You are confusing archive.is with archive.org. Although archive.is does have an extension[1] it doesn't appear to capture any of the page contents, it just simply sends the url for archive.is to crawl.
        1: https://chrome.google.com/webstore/detail/archive-page/gcaim...
        [-]
        lcnPylGDnU4H9OF 717 days ago
        I wasn't exactly confusing them but yeah, I did link to an archive.org article. I was having difficulty finding something specific to archive.is.
        I think the distinction between the two is moot in this post. The question could very well have been "How does archive.org bypass paywalls?" Though it's interesting that archive.is seems to just crawl the URL. Indeed that means they wouldn't necessarily be able to bypass the paywall.
janejeon 718 days ago
> If it identifies itself as archive.is, then other people could identify themselves the same way.
Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.
[-]
- lazzlazzlazz 718 days ago
  It would be far better and more secure for archive.is to publish a public key on its site and then sign requests from its private key, which sites could optionally verify.
  [-]
  - sublinear 717 days ago
    You just described client certificate auth
  - facile 718 days ago
    +1 on this!
- flerovium 718 days ago
  In theory, this might work. But is it true? Do lots of sites have an archive.is whitelist?
  [-]
  - arbitrage 717 days ago
    I really don't see why they would, if they're using a paywall in the first place.
  - w1nst0nsm1th 717 days ago
    Follow the magnolia trail...
Miner49er 718 days ago
According to their blog they use AMP: https://blog.archive.today/post/675805841411178496/how-does-...
[-]
- flerovium 718 days ago
  This explanation is incomplete. Counterexample:
  Amp pages are paywalled:
  https://www.wsj.com/articles/freeze-or-cut-spending-fight-is... https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...
  archive.is isn't: https://archive.md/LaiOX
  [-]
  - Deathmax 717 days ago
    For WSJ at least, it appears that archive.is is fetching the AMP page, which returns the full content of the article and is hidden with CSS, and modifying the page to unhide the paywalled content + hide ads.
    It might be using other techniques as well for bypassing paywalls, be it referer/user-agent spoofing (some old archives of sites that echo back HTTP request headers have archive.is sending a Referer of google.co.uk).
  - nora-puchreiner 717 days ago
    Try this: https://www.wsj.com/amp/articles/freeze-or-cut-spending-figh...
  - Reventlov 717 days ago
    I can access the wsj article without any account using https://gitlab.com/magnolia1234/bypass-paywalls-firefox-clea... (bypass paywall clean)
- JohnFen 718 days ago
  Wow, an actually good use for Amp? Amazing.
  [-]
  - Aachen 717 days ago
    I'm sure it was an accident or honest mistake!
armchairhacker 717 days ago
A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.
There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.
Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.
alex_young 717 days ago
Not specifically related to archive.is, but news sites have a tightrope to walk.
They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.
Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.
retrocryptid 718 days ago
Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.
Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.
Or maybe they're okay with letting the archive index their content...
[-]
- Atlas22 717 days ago
  Just FYI google and bing publish their user agent strings[1][2] for the crawlers. At least in my experience most of the typical ad-infested and paywalled news sites wont display the paywall if you change the user agent to a crawler they prefer.
  [1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...
- wolverine876 717 days ago
  Doesn't almost every site on the web know exactly what the Google bot looks like?
- peter422 717 days ago
  Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.
  [-]
  - Aachen 717 days ago
    Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.
w1nst0nsm1th 717 days ago
If the people who know that tell you, they could lose access to said ressources.
But it's kind of an open secret, you just don't look in the right place.
thallosaurus 716 days ago
I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)
(https://archive.is/1h4UV)
xiekomb 718 days ago
I thought they used this browser extension: https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean
[-]
- flerovium 718 days ago
  That extension does work, but do we know they use it?
  [-]
  - marcod 717 days ago
    They don't always use it, because I can archive a new page from my mobile phone browser, which doesn't even support extensions.
    My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.
- DrDentz 717 days ago
  It's actually the opposite, for some news sites this extension links to archive.is because that's the only known way to bypass the paywall.
  [-]
  - nora-puchreiner 717 days ago
    There are known ways to bypass paywall which are just impossible to implement within a browser extension while trivial on 12ft or archive. For example, to use Ukrainian residential proxy as some news websites granted free access from.
jrochkind1 717 days ago
Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.
But I don't know the answer either.
Yujf 717 days ago
I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik
[-]
- strunz 717 days ago
  12ft.io also doesn't work or is disabled for many sites that archive.is still works on
  [-]
  - hda111 717 days ago
    Maybe because the creator of 12ft.io isn't anonymous
- janejeon 717 days ago
  Wouldn't sites be able to see that requests from 12ft.io isn't coming from Google's IPs?
  [-]
  - dpifke 717 days ago
    Yes.
    Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...
    You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json
    [-]
    - nora-puchreiner 717 days ago
      https://search.google.com/test/rich-results?url= operates from legit Googlebot IPs so it allows anyone to get the paywalled content even archive.is fails to fetch (from theinformation.com, for example)
    - rahimnathwani 717 days ago
      "Google recommends using reverse DNS to verify..."
      This is almost right. They recommend two steps:
      1. Use reverse DNS to find the hostname the IP address claims to have. (The IP address block owner can put any hostname in here, even if they don't own/control the domain.)
      2. Assuming the claimed hostname is on one of Google's domains, do a forward DNS lookup to verify that the original IP address is returned.
      The second step is the important one.
firexcy 716 days ago
My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.
w1nst0nsm1th 717 days ago
Follow the magnolia trail...
shipscode 717 days ago
What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.
riffic 717 days ago
your browser usually downloads an entire article and certain elements are overlayed.
it's trivial to bypass most paywalls isn't it?
[-]
- aidenn0 717 days ago
  Not for some (I think the Wall Street Journal). Apparently the AMP version of the page does work this way for WSJ though, which is how IA gets around the paywall.
aaron695 717 days ago
[dead]
gregjor 718 days ago
[flagged]
not_your_vase 717 days ago
They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.
mr-pink 718 days ago
every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content
jwildeboer 718 days ago
It’s internet magic. <rainbowmagicsparkles.gif> ;)
jakedata 717 days ago
Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.