Pretty cool, but I recommend anyone wanting to do this kind of thing to check out the source Puppeteer library. You can do some really powerful stuff and make a custom crawler fairly easily.
Unless something has changed that I missed, you can install extensions (I complained when the default args messed this up [0]). For example, I built something that uses puppeteer and an extension to capture audio and video of a tab [1]. It's just headless mode that doesn't allow extensions [2] (which I now realize is probably what you meant).
Puppeteer seems needlessly difficult to use on a VPS. I'd prefer an easily dockerized version but there seems to be nothing robust and they make it VERY hard to connect to a docker instance just running Chrome for the websocket/9222 interface sadly.
I've done this recently actually. Take a look at the yukinying/chrome-headless-browser[0] image. You'll need to run with the SYS_ADMIN capability and up the shm_size to 1024M (you can workaround the SYS_ADMIN cap with a seccomp file but I didn't have much luck with that). Other than that oddness it works pretty well (and with Puppeteer 1.0, with far fewer crashes).
This has been possible for a long time with any browser using Selenium for example. It has APIs and client libraries for many languages.
Also using a real browser brings a lot of problems: high resource consumption, hangs, it is unclear when the page has finished loading etc. You have to supervise all browser processes. And if you use promises, there is high chance that you will miss error messages because promises hide them by default.
While as a developer I find this super interesting, as a system administrator this makes me cringe. We don't have a lot of resources for servers, and I end up spending a disproportionate amount of time banning IPs from bots running poorly configured tools like this, which aren't rate limited and crush websites.
I'm grateful that "Obey robots.txt" is listed as part of it's standard behavior. If only scrapers cared enough to use it as well.
I've found that mod_evasive[1] works particularly well in these situation and helped us a lot dealing with it (though I'm not a sysadmin and I'm sure there are better tools to deal with it). But for someone who is just a webmaster, I'd recommend using it for a quick dirty fix for such hassles.
Such crawler should not be difficult to ban by looking at stats - if there are many requests per IP per unit of time, or many requests from data center IPs, or many requests from Linux browsers, it is likely bots and you can ban them (you can ban whole data center to be sure).
There are a lot of folks reevaluating their crawling engines lately now that Chrome headless is maturing. To me there are some important considerations in terms of CPU/memory footprint that go into distributing a large headless crawling architecture.
The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.
I was stuck last time I was using headless chrome when I needed to use a proxy with an username and a paasword. Headless chrome just doesn't support it. Any changes on that?
Thanks. Wish it was simpler. It seems overkill to have to an extra proxy in the middle with no auth to authenticate with the one with auth, just to make headless chrome working.
I've been considering writing my own puppeteer docker image such that one could freeze the image at crawl time after a page has loaded. This would allow me to re-write the page-parsing logic after the page layout changes. Has anyone done this already or know of any other efforts to serialize the puppeteer page object to handle parsing bugs?
In large scale scraping they always separate loading pages from processing the data. Easiest thing would be to wait for the page to load and then save the html
yes I've used puppeteer, but I couldn't see where it exposed the page object, but looking closer I saw
HCCrawler.launch({
// Function to be evaluated in browsers
evaluatePage: (() => ({
title: $('title').text(),
})),
// Function to be called with evaluated results from browsers
onSuccess: (result => {
console.log(result);
}),
})
which doesn't look that good for my other worry, that you come to a dynamic JS built up page, load the DOM and evaluate in some milliseconds and then that DOM is changed after you've done your incorrect evaluation.
However if you have a page that loads with 4 links defined, does its script and then ends up 100+ links you miss the 100+ links. I notice people often failing to account for this in their crawlers, so I wondered if this one was.
https://github.com/GoogleChrome/puppeteer
I haven’t looked into it, but I imagine it has a pretty clear fingerprint as well. So it would be easier to block than stock chrome in headless mode.
0 - https://github.com/GoogleChrome/puppeteer/issues/850 1 - https://github.com/cretz/chrome-screen-rec-poc/tree/master/a... 2 - https://bugs.chromium.org/p/chromium/issues/detail?id=706008
Let me quickly add instructions here, first you need to install some dependancies, add the following to dockerfile:
Secondly, launch puppeteer with --no-sandbox option: That should do it.[0]: https://github.com/yukinying/chrome-headless-browser-docker
Also using a real browser brings a lot of problems: high resource consumption, hangs, it is unclear when the page has finished loading etc. You have to supervise all browser processes. And if you use promises, there is high chance that you will miss error messages because promises hide them by default.
https://github.com/webfolderio/cdp4j
I'm grateful that "Obey robots.txt" is listed as part of it's standard behavior. If only scrapers cared enough to use it as well.
[1] https://www.digitalocean.com/community/tutorials/how-to-prot...
The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.
It's around 200k and very easy to configure
https://hub.docker.com/r/massimo/cntlm/
Yeah you just call page.authenticate(user,pass).
Wget is also a pretty robust crawler, but people have requested a proxy that archives every site they visit in real-time more than a crawler.
Basically, when new page event happens, you get the `page` object where you have access to it and can do queries.
https://github.com/GoogleChrome/puppeteer/blob/v1.1.0/docs/a...
HCCrawler.launch({ // Function to be evaluated in browsers evaluatePage: (() => ({ title: $('title').text(), })), // Function to be called with evaluated results from browsers onSuccess: (result => { console.log(result); }), })
which doesn't look that good for my other worry, that you come to a dynamic JS built up page, load the DOM and evaluate in some milliseconds and then that DOM is changed after you've done your incorrect evaluation.
However if you have a page that loads with 4 links defined, does its script and then ends up 100+ links you miss the 100+ links. I notice people often failing to account for this in their crawlers, so I wondered if this one was.