I’ve been using HtmlUnit to crawl the web for the past couple of weeks. This interesting experience has led to two new features:
First, I’ve added an insecure SSL handler which trusts anyone and everyone. Why? Because websites often have misconfigured or expired SSL certificates, and the standard Java behavior is to throw a bunch of exceptions when this happens. Not very nice. So now you can call WebClient.setUseInsecureSSL(true) instead and continue crawling, happily oblivious to the webmaster’s incompetence.
Second, I’ve added a popup blocker. Lots of sites send a bunch of popups your way, and even though they’re not quite as annoying when you’re using a headless browser like HtmlUnit, they still waste time and bandwidth. So now you can call WebClient.setPopupBlockerEnabled(true), and your crawler will be that much faster.
These features will be available in HtmlUnit 1.14, or you can just grab the latest snapshot build here. Enjoy!