Robert Martin has some constructive criticism for HtmlUnit:
I’m using HtmlUnit to parse and interpret HTML web pages. I’ve been very impressed with this library so far. And I appreciate the hard work and dedication of people who give their software away for free. So, although this blog is a complaint, it should not be misconstrued into anything more than constructive criticism.
What I want to do with HtmlUnit is quite simple. Given a string containing HTML, I’d like to query that HTML for certain tags and attributes… Sweet, simple, uncomplicated. Just create the DOM from an HTML String, and then query that DOM. Unfortunately, HtmlUnit does not appear to be that simple.
Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the JavaScript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.
It does indeed look like Robert had to do some digging in order to bend HtmlUnit to his will. However, I have to say that I’ve never seen anyone else use the library like he has, as a glorified HTML parser. HtmlUnit’s raison d’être is to test web applications — it is essentially a headless browser. Some people use it for screen scraping, because it can handle some pretty ugly HTML and JavaScript. But the further away from its intended use you get, the more trouble you are going to run into.
Take, for example, the complaint that he had to disable JavaScript processing. If we disabled JavaScript processing by default, 95% of our users would take to the streets howling for our blood — and I think I speak for all of us when I say that none of the HtmlUnit devs have a death wish
Robert’s central complaint is that it’s hard to feed HtmlUnit a String and get back an HtmlPage. The problem is that WebClient, the central HtmlUnit class, does in fact have a getPage(String) method — but it’s geared to the vast majority of users who want to be able to call getPage(“http://my.testable.site”), rather than getPage(“<html><head><title>My Testable Site</title>…</html>”).
Robert claims that HtmlUnit violates the principle of least surprise, i.e. “do the least surprising thing.” Surprise is a subjective beast, but I would submit that catering to the 95% of users who use HtmlUnit as it was intended is definitely the least surprising route to take. That’s not to say that we can’t accomodate alternate uses (perhaps by adding a getPageForHtml(String) to WebClient), but not at the expense of our core competency.
jack said,
April 21, 2008 at 6:15 pm
Interesting, I came across the same issue (trying to convert raw html into an HtmlPage). A potential scenario is when you have logged raw html files and are trying to do simple tests on them (e.g. make sure right divs, buttons etc are on the html page).
Let me add that HtmlUnit is awesome
You guys rock! I am new to the world of open source, and it always astounds me what quality software has been created by developers in their free time. Philosophically speaking, it is a form of charity. Kudos to you.
Daniel Gredler said,
April 23, 2008 at 11:26 am
Hi Jack,
Thank you for your kind words.
If you do want a getPageForHtml(String) method added to WebClient (or any other feature) head over to the SF project page and add a feature request (http://sourceforge.net/projects/htmlunit, under Tracker).
Take care,
Daniel