Show HN: Headless Chrome Crawler

(github.com)

172 points | by yujiosaka 2686 days ago

12 comments

ptasker 2686 days ago
Pretty cool, but I recommend anyone wanting to do this kind of thing to check out the source Puppeteer library. You can do some really powerful stuff and make a custom crawler fairly easily.
https://github.com/GoogleChrome/puppeteer
[-]
- itsjustme2 2686 days ago
  Looks like this is actually built on top of puppeteer. See the "Note" under "Installation": https://github.com/yujiosaka/headless-chrome-crawler/blob/ma...
- chatmasta 2686 days ago
  Puppeteer has some limitations. You can’t install extensions, for example.
  I haven’t looked into it, but I imagine it has a pretty clear fingerprint as well. So it would be easier to block than stock chrome in headless mode.
  [-]
  - kodablah 2686 days ago
    Unless something has changed that I missed, you can install extensions (I complained when the default args messed this up [0]). For example, I built something that uses puppeteer and an extension to capture audio and video of a tab [1]. It's just headless mode that doesn't allow extensions [2] (which I now realize is probably what you meant).
    0 - https://github.com/GoogleChrome/puppeteer/issues/850 1 - https://github.com/cretz/chrome-screen-rec-poc/tree/master/a... 2 - https://bugs.chromium.org/p/chromium/issues/detail?id=706008
  - simooooo 2686 days ago
    I can't puppeteer to screen capture websites very well, crashes entirely after a few
- robk 2686 days ago
  Puppeteer seems needlessly difficult to use on a VPS. I'd prefer an easily dockerized version but there seems to be nothing robust and they make it VERY hard to connect to a docker instance just running Chrome for the websocket/9222 interface sadly.
  [-]
  - asadlionpk 2686 days ago
    I recently did this in a Docker.
    Let me quickly add instructions here, first you need to install some dependancies, add the following to dockerfile:
```
  RUN apt-get install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
```
    Secondly, launch puppeteer with --no-sandbox option:
```
  browser = await puppeteer.launch({
      args: ['--no-sandbox'] /*, headless: false*/
    })
```
    That should do it.
  - RossM 2686 days ago
    I've done this recently actually. Take a look at the yukinying/chrome-headless-browser[0] image. You'll need to run with the SYS_ADMIN capability and up the shm_size to 1024M (you can workaround the SYS_ADMIN cap with a seccomp file but I didn't have much luck with that). Other than that oddness it works pretty well (and with Puppeteer 1.0, with far fewer crashes).
    [0]: https://github.com/yukinying/chrome-headless-browser-docker
- isuckatcoding 2686 days ago
  Yeah I’d really rather that people made extensions to Pupeteer rather than a whole new library.
codedokode 2686 days ago
This has been possible for a long time with any browser using Selenium for example. It has APIs and client libraries for many languages.
Also using a real browser brings a lot of problems: high resource consumption, hangs, it is unclear when the page has finished loading etc. You have to supervise all browser processes. And if you use promises, there is high chance that you will miss error messages because promises hide them by default.
[-]
- webfolder 2686 days ago
  Selenium is not suitable for website crawling. cdp4j is more suitable for this kind of works.
  https://github.com/webfolderio/cdp4j
tesin 2686 days ago
While as a developer I find this super interesting, as a system administrator this makes me cringe. We don't have a lot of resources for servers, and I end up spending a disproportionate amount of time banning IPs from bots running poorly configured tools like this, which aren't rate limited and crush websites.
I'm grateful that "Obey robots.txt" is listed as part of it's standard behavior. If only scrapers cared enough to use it as well.
[-]
- superasn 2686 days ago
  I've found that mod_evasive[1] works particularly well in these situation and helped us a lot dealing with it (though I'm not a sysadmin and I'm sure there are better tools to deal with it). But for someone who is just a webmaster, I'd recommend using it for a quick dirty fix for such hassles.
  [1] https://www.digitalocean.com/community/tutorials/how-to-prot...
- codedokode 2686 days ago
  Such crawler should not be difficult to ban by looking at stats - if there are many requests per IP per unit of time, or many requests from data center IPs, or many requests from Linux browsers, it is likely bots and you can ban them (you can ban whole data center to be sure).
tegansnyder 2686 days ago
There are a lot of folks reevaluating their crawling engines lately now that Chrome headless is maturing. To me there are some important considerations in terms of CPU/memory footprint that go into distributing a large headless crawling architecture.
The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.
hartator 2686 days ago
I was stuck last time I was using headless chrome when I needed to use a proxy with an username and a paasword. Headless chrome just doesn't support it. Any changes on that?
[-]
- jancurn 2686 days ago
  There's a workaround - https://blog.apify.com/how-to-make-headless-chrome-and-puppe...
  [-]
  - hartator 2686 days ago
    Thanks. Wish it was simpler. It seems overkill to have to an extra proxy in the middle with no auth to authenticate with the one with auth, just to make headless chrome working.
    [-]
    - timstapl 2686 days ago
      There's also page.authenticate, which has worked well for me. https://github.com/GoogleChrome/puppeteer/blob/master/docs/a...
    - jancurn 2686 days ago
      You can also use page.authenticate() for that - see a note at the bottom of the article. Also see https://github.com/GoogleChrome/puppeteer/pull/1732
- transreal 2686 days ago
  Actually, we just figured out how to do this. Details here: https://bugs.chromium.org/p/chromium/issues/detail?id=741872...
  [-]
  - hartator 2686 days ago
    Awesome. I need to figure out a way to make it work with our Ruby code, but it shouldn't be that hard. Thanks.
- walterstucco 2686 days ago
  I created a tiny docker image to solve this issue
  It's around 200k and very easy to configure
  https://hub.docker.com/r/massimo/cntlm/
- radioo75555 2686 days ago
  I think I've done that before.
  Yeah you just call page.authenticate(user,pass).
princehonest 2686 days ago
I've been considering writing my own puppeteer docker image such that one could freeze the image at crawl time after a page has loaded. This would allow me to re-write the page-parsing logic after the page layout changes. Has anyone done this already or know of any other efforts to serialize the puppeteer page object to handle parsing bugs?
[-]
- radioo75555 2685 days ago
  In large scale scraping they always separate loading pages from processing the data. Easiest thing would be to wait for the page to load and then save the html
nikisweeting 2686 days ago
I'm thinking about adding a crawler to Bookmark Archiver, to augment the headless chrome screenshotting and PDFing that it already does.
Wget is also a pretty robust crawler, but people have requested a proxy that archives every site they visit in real-time more than a crawler.
bryanrasmussen 2686 days ago
Can't see from examples, how do I get back individual elements from the body?
[-]
- diggan 2686 days ago
  Not a user of this tool, but https://github.com/yujiosaka/headless-chrome-crawler#event-n... points to https://github.com/GoogleChrome/puppeteer/blob/master/docs/a... where you can grab elements.
  Basically, when new page event happens, you get the `page` object where you have access to it and can do queries.
  [-]
  - ej12n 2686 days ago
    `page.$(query)` or `page.$$(queryAll)` to be more specific
    https://github.com/GoogleChrome/puppeteer/blob/v1.1.0/docs/a...
  - bryanrasmussen 2686 days ago
    yes I've used puppeteer, but I couldn't see where it exposed the page object, but looking closer I saw
    HCCrawler.launch({ // Function to be evaluated in browsers evaluatePage: (() => ({ title: $('title').text(), })), // Function to be called with evaluated results from browsers onSuccess: (result => { console.log(result); }), })
    which doesn't look that good for my other worry, that you come to a dynamic JS built up page, load the DOM and evaluate in some milliseconds and then that DOM is changed after you've done your incorrect evaluation.
artur_makly 2686 days ago
there's this too: https://github.com/brendonboshell/supercrawler
agotterer 2686 days ago
Nice job! Can this be scaled and distributed to multiple machines?
[-]
- artur_makly 2686 days ago
  maybe with this? https://Browserless.io
bryanrasmussen 2686 days ago
also how does this handle pages that load with a small number of links and then uses JS to write in a bunch of DOM nodes and links?
[-]
- trevyn 2686 days ago
  I don't know about this project specifically, but typically with headless Chrome, you let it run the JS and then read the DOM.
  [-]
  - bryanrasmussen 2686 days ago
    most naive code I see is like the following
```
    const page = await crawler.browser.newPage();
    await page.goto(url);
    await page.waitForSelector("a[href]");
    const hrefs = await page.evaluate(
    () => Array.from(document.body.querySelectorAll('a[href]'), ({ href }) => href)
    );
```
    and then you do something with hrefs.
    However if you have a page that loads with 4 links defined, does its script and then ends up 100+ links you miss the 100+ links. I notice people often failing to account for this in their crawlers, so I wondered if this one was.
londt8 2686 days ago
is it possible to scrape songs from spotify web app with this?