May 12, 2007 at 11:47 am (HtmlUnit, Java)
Robert Martin has some constructive criticism for HtmlUnit:
I’m using HtmlUnit to parse and interpret HTML web pages. I’ve been very impressed with this library so far. And I appreciate the hard work and dedication of people who give their software away for free. So, although this blog is a complaint, it should not be misconstrued into anything more than constructive criticism.
What I want to do with HtmlUnit is quite simple. Given a string containing HTML, I’d like to query that HTML for certain tags and attributes… Sweet, simple, uncomplicated. Just create the DOM from an HTML String, and then query that DOM. Unfortunately, HtmlUnit does not appear to be that simple.
Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the JavaScript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.
It does indeed look like Robert had to do some digging in order to bend HtmlUnit to his will. However, I have to say that I’ve never seen anyone else use the library like he has, as a glorified HTML parser. HtmlUnit’s raison d’être is to test web applications — it is essentially a headless browser. Some people use it for screen scraping, because it can handle some pretty ugly HTML and JavaScript. But the further away from its intended use you get, the more trouble you are going to run into.
Take, for example, the complaint that he had to disable JavaScript processing. If we disabled JavaScript processing by default, 95% of our users would take to the streets howling for our blood — and I think I speak for all of us when I say that none of the HtmlUnit devs have a death wish
Robert’s central complaint is that it’s hard to feed HtmlUnit a String and get back an HtmlPage. The problem is that WebClient, the central HtmlUnit class, does in fact have a getPage(String) method — but it’s geared to the vast majority of users who want to be able to call getPage(”http://my.testable.site”), rather than getPage(”<html><head><title>My Testable Site</title>…</html>”).
Robert claims that HtmlUnit violates the principle of least surprise, i.e. “do the least surprising thing.” Surprise is a subjective beast, but I would submit that catering to the 95% of users who use HtmlUnit as it was intended is definitely the least surprising route to take. That’s not to say that we can’t accomodate alternate uses (perhaps by adding a getPageForHtml(String) to WebClient), but not at the expense of our core competency.
2 Comments
May 5, 2007 at 10:49 pm (HtmlUnit, Java)
As an HtmlUnit committer, every month or so I spend about half an hour searching the web for new mentions of the project, just to get an idea of the latest buzz, unreported problems, unsung praises, etc. During the latest of these searches, I ended up at ohloh, an online directory for open source software projects.
The site lists a large number of projects, and includes interesting metrics for each of them, including codebase size, estimated effort and cost necessary to reproduce the codebase, level of documentation, related projects, user ratings and reviews, etc. Here are some of the metrics, as of today, for HtmlUnit and some related projects:
Both jWebUnit and Canoo WebTest are based on HtmlUnit, using it internally to do the HTML / JavaScript heavy lifting. However, jWebUnit appears to be a thin wrapper, while Canoo WebTest is larger than HtmlUnit itself! HttpUnit is somewhat smaller than HtmlUnit, which makes sense: HtmlUnit provides a higher level of abstraction and supports many more JavaScript constructs than HttpUnit does.
It’ll be interesting to see if this site takes off or not. Developers often have to pull information from a wide array of sources in order to make informed decisions as to the libraries that they need to include in their stack. While ohloh provides some of this information, there is one glaring omission: community size and growth trends. This is usually measured via mailing list activity — lots of posts to the mailing lists imply a large user community. It would be nice to have this information listed as well, and not just as a factoid in the summary section.
Regardless of this small omission, ohloh is a fun site to browse. It’s interesting to see their take on the various software projects out there, and it’s entertaining to compare competing libraries (SVN vs CVS!). If lots of people actually start using it to recommend and critique software, it might become an extremely useful website.
2 Comments
May 1, 2007 at 9:56 pm (Java, Tapestry)
Well, it’s official: I can now add features to Tapestry without a middleman! If Howard hadn’t beat me to it, I’d probably start out by adding a BeanForm component to Tapestry 5. As things stand, I’ll probably start small — fix some bugs, add some small features, etc.
This also increases the pressure to get HtmlUnit working with the Prototype (Tapestry 5) and Dojo (Tapestry 4) JavaScript libraries. There’s no excuse for my two pet projects to remain immiscible. Well… except that it’s hard to simulate all of the extravagant browser features that these newfangled JavaScript libraries require
I enjoy writing (to a degree), so one of my self-imposed tasks may also be to polish the documentation for Tapestry 5. One of the more prominent criticisms of Tapestry 4 has been that the learning curve is steep and the documentation is sub par. It’d be nice to turn that around and move Tapestry 5 to the opposite side of the spectrum: a gentle learning curve facilitated by excellent documentation. On vera comment ça va aller…
2 Comments