URL.hashCode() Considered Harmful

I just cut HtmlUnit’s build time by about 20% by changing four lines of code. How? HtmlUnit keeps a small cache of web requests in a HashMap, keyed on the request URL. The problem with this is twofold:

  1. The URL.hashCode() method is synchronized.
  2. The URL.hashCode() method triggers DNS lookups for the URL hosts.

The impact of item 2 was magnified by the fact that some of the HtmlUnit unit tests use a mock web connection to connect to fake URLs. DNS (non)resolution of these fake URLs took an especially long time.

The fix was to key the map entries on the value of URL.toString() instead. Apparently I’m not the first person to stumble across this problem. So think twice before coding your next HashMap<URL, XXX> ;-)

HtmlUnit 2.1 Released

The HtmlUnit team is pleased to announce a new release of HtmlUnit. This latest version includes a number of bug fixes and performance enhancements, and sports excellent support for GWT, jQuery and Sarissa, decent support for Prototype and Dojo, and basic support for YUI. Please see the changelog for more details.

In related news, we’ve (temporarily) forked the Rhino JavaScript engine in order to add browser-compatible JavaScript behavior which is slowly making its way into the Rhino project proper. The most important of these changes (so far) is definition-order property iteration. All of this should be available in the next version; many thanks to Marc Guillemot for his work in this area.

Anyway, give it a whirl and let us know what you think!

HtmlUnit in the Wild: New Features

I’ve been using HtmlUnit to crawl the web for the past couple of weeks. This interesting experience has led to two new features:

First, I’ve added an insecure SSL handler which trusts anyone and everyone. Why? Because websites often have misconfigured or expired SSL certificates, and the standard Java behavior is to throw a bunch of exceptions when this happens. Not very nice. So now you can call WebClient.setUseInsecureSSL(true) instead and continue crawling, happily oblivious to the webmaster’s incompetence.

Second, I’ve added a popup blocker. Lots of sites send a bunch of popups your way, and even though they’re not quite as annoying when you’re using a headless browser like HtmlUnit, they still waste time and bandwidth. So now you can call WebClient.setPopupBlockerEnabled(true), and your crawler will be that much faster.

These features will be available in HtmlUnit 1.14, or you can just grab the latest snapshot build here. Enjoy!

Assertions in HtmlUnit

I’ve been going back through the pros and cons of JWebUnit as part of my research for the HtmlUnit vs Foo series of articles I’m writing. One of JWebUnit’s big draws is the set of easy-to-use assertion methods provided by its base test case class, WebTestCase. HtmlUnit doesn’t provide such a thing, because it doesn’t provide a base test case class.

There has always been something of a trade-off here: use JWebUnit and tie yourself to a specific unit testing framework (JUnit) while benefiting from a more domain-specific set of assertions (assertCookiePresent, assertFormPresent, assertLinkPresent, etc), or fly free with HtmlUnit but perform assertions using only the primitive utility methods provided by your unit testing framework (assertNull, assertNotNull, assertEquals, etc).

However, I’ve long though that it would be nice for HtmlUnit to have the best of both worlds by using an assertion utility class, similar to TestNG’s Assert class. Experiencing the convenience of JWebUnit’s API again has given me the final kick in the pants, and the first version of HtmlUnit’s new WebAssert class is now in SVN. It will be included as part of HtmlUnit 1.14, or you can always grab the latest build here. I’m sure the set of available assertions will grow, but here is the initial list:

  • assertTitleEquals(HtmlPage, String)
  • assertTitleContains(HtmlPage, String)
  • assertTitleMatches(HtmlPage, String)
  • assertElementPresent(HtmlPage, String)
  • assertElementPresentByXPath(HtmlPage, String)
  • assertElementNotPresent(HtmlPage, String)
  • assertElementNotPresentByXPath(HtmlPage, String)
  • assertTextPresent(HtmlPage, String)
  • assertTextPresentInElement(HtmlPage, String, String)
  • assertTextNotPresent(HtmlPage, String)
  • assertTextNotPresentInElement(HtmlPage, String, String)
  • assertLinkPresent(HtmlPage, String)
  • assertLinkNotPresent(HtmlPage, String)
  • assertLinkPresentWithText(HtmlPage, String)
  • assertLinkNotPresentWithText(HtmlPage, String)
  • assertFormPresent(HtmlPage, String)
  • assertFormNotPresent(HtmlPage, String)
  • assertInputPresent(HtmlPage, String)
  • assertInputNotPresent(HtmlPage, String)
  • assertInputContainsValue(HtmlPage, String, String)
  • assertInputDoesNotContainValue(HtmlPage, String, String)

HtmlUnit vs HttpUnit

There’s a lot of misinformation out there regarding web application test tools, so I’ve decided to post a series of short articles comparing some of the open source options available here in Java-land, circa 2007. The first of these articles will focus on HtmlUnit and HttpUnit. Please take my criticism and praise with a grain of salt, as I’m a committer to the HtmlUnit project and thus probably biased. Nevertheless, I will do my best to be objective. I may even overcompensate in the other direction!

Confusion

The HtmlUnit and HttpUnit projects are often confused due to the similarity of their names. And the similarity doesn’t end there: they are both open source projects; they are both 100% Java frameworks, rather than drivers for native browsers like IE or Firefox; and they are both fairly mature projects.

This confusion is compounded by the fact that many test frameworks which once used HttpUnit under the covers have since switched to using HtmlUnit, mainly in order to benefit from its excellent JavaScript support. Examples include JWebUnit, whose FAQ briefly explains the switch, and Canoo WebTest, which switched in 2004 due to JavaScript support issues and an unresponsive development team [1].

HttpUnit

HttpUnit is the granddaddy web app testing framework. Started in the summer of 2000 by Russ Gold [2], it was the first project to focus on this niche area. The project has since stagnated somewhat, with nearly 40% of bugs remaining open, some of them nearly three years old. Its latest maintenance release is about a year and a half old.

The API is fairly low-level, modeling web interactions at something approaching the HTTP request and response level. The following is a slightly modified example from the HttpUnit Cookbook:

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://www.google.com/");
WebLink link = resp.getLinkWith("About Google");
link.click();
WebResponse resp2 = wc.getCurrentPage();

As you can see, things center around WebConversations, WebRequests and WebResponses. Unfortunately, any page with a decent amount of JavaScript is likely to break HttpUnit, and you can absolutely forget testing any pages which use third party JavaScript libraries.

Nevertheless, HttpUnit continues to generate 3,000 to 4,000 downloads per month. A good analogy, if I may be allowed a brief subjective comment, is that HttpUnit is to the web app testing world what Struts is to the web app framework world: there are many “better” options out there, but it just won’t go away! ;-)

HtmlUnit

HtmlUnit is itself a fairly old project, having been started by Mike Bowler in early 2002. Mike has since ceased active development, but the project currently boasts 3 or 4 active developers and a total of seven committers (whereas HttpUnit remains a one-man show). It averages about three releases per year, and has seen increased developer activity in the past six months or so, especially in the area of JavaScript support.

HtmlUnit’s API is a bit more high-level than HttpUnit’s, modeling web interaction in terms of the documents and interface elements which the user interacts with:

WebClient wc = new WebClient();
HtmlPage page = (HtmlPage) wc.getPage("http://www.google.com");
HtmlForm form = page.getFormByName("f");
HtmlSubmitInput button = (HtmlSubmitInput) form.getInputByName("btnG");
HtmlPage page2 = (HtmlPage) button.click();

As you can see, the code centers around WebClients, as well as pages, links, forms, buttons, etc. Pages with a modicum of custom JavaScript will probably work when tested with HtmlUnit. Unfortunately, pages which use third party libraries might or might not work when tested via HtmlUnit. As of the current version, Prototype, Script.aculo.us, DWR and jQuery are known to be supported fairly well, Dojo is a bit of an unknown, YUI is known to be unsupported, and GWT is known to work with fairly simple applications. Most of this compatibility has been achieved in the past two or three releases, so obviously things are fairly fluid.

Conclusion

If you’re using HttpUnit for legacy reasons, it’s a fairly solid package, but don’t expect to get much support when you need to report a bug or submit a patch for a new feature. If you’re starting a new project and are trying to decide between these two frameworks, HtmlUnit wins hands down. It has the features, the community and the momentum.

Of course, if you’re considering web application testing tools, you’re probably looking at more than just these two options. Canoo WebTest, TestMaker, JWebUnit, Selenium, WebDriver and JMeter are all likely to be on your list. Depending on your project budget and requirements, Squish and Mercury QTP may also be under consideration. If that’s the case, stay tuned, because I intend to post a series of web app testing framework comparisons in the coming months — all of them involving HtmlUnit, of course!

[1] It’s interesting to note that both Marc Guillemot and I (two HtmlUnit committers) began by using HttpUnit, submitting patches for missing features — but settled on HtmlUnit when the patches were not applied in a timely manner.

[2] The HttpUnit website states that Russ currently works for Oracle, developing the OC4J application server. Coincidentally, this is the production application server we’re using at my day job. Thanks, Russ! :-)

HtmlUnit 1.12 Released

As per Marc’s announcement on the mailing lists and my post to TSS, HtmlUnit 1.12 has been released.

It contains a really mind-blowing number of bugfixes, a couple of very important performance improvements (including one last minute change which cut the build time by a third), and a couple of new features like Marc’s experimental AJAX controller. The change log has all the details.

Progress has been made in the compatibility department on a number of fronts: Marc Guillemot has been working on script.aculo.us drag’n'drop support, more Prototype unit tests are passing, I’ve gotten all the jQuery unit tests to pass, and Ahmed Ashour has committed support for basic GWT applications.

Robert Di Marco and I are both interested in investigating YUI compatibility, so there may be some news to look forward to in that area.

Enjoy!

HtmlUnit: Taming jQuery

Nearly three weeks ago my wife went on a trip with her sister’s family to visit her grandfather, who was celebrating his ninetieth birthday in Indiana. She was gone for four days, including an entire weekend. Now some guys might take an opportunity like that, call up whatever single friends they have left, and have some fun out on the town. I, on the other hand, realized that it had been a really, really long time since I’d last been able to indulge in an all-night hackathon ;-)

I’ve got a couple of side projects going, but I finally settled on trying to get HtmlUnit to run through the jQuery unit tests. As web applications have gotten more and more complex, they’ve begun to rely on really complex libraries like jQuery, Dojo, Prototype and YUI to abstract away lower-level details like browser incompatibility and tedious APIs like getElementsByTagNameButOnlyOnMonday()… or something like that. I forget. The point is that HtmlUnit has not kept up with this trend (though we’re fairly compatible with Prototype).

Knowing that HtmlUnit has fairly advanced JavaScript support, I figured that full jQuery support couldn’t be more than two days work away. Well, three weeks later, I’m proud to announce that HtmlUnit 1.12 will fully support jQuery, and we can prove it because we run jQuery’s own unit tests as part of HtmlUnit’s unit tests. That’s a lot of units!

So now for the details. The following HtmlUnit bugs had to be fixed in order to reach this point:

  • Bug in string.replace() implementation when using string parameters.
  • Bug in string.replace() implementation when using function parameters.
  • Bug in the scope of eval() invocations in event handlers.
  • Bug in document.getElementsByTagName(’*'), which was always returning an empty collection.
  • Bug in default option selection in select elements without an explicit selection.
  • Add missing IE-only element.style.filter attribute.
  • Bug in element.innerHTML and element.outerHTML: uppercase tag names when emulating IE.
  • Bug in element.innerHTML and element.outerHTML: don’t quote attributes that don’t contain whitespace when emulating IE.
  • Bug in element.innerHTML and element.outerHTML: always use separate open and close tags (even if the tag is empty).
  • Bug in element.innerHTML, element.innerText and element.outerHTML: escape XML characters inside text nodes.
  • Allow background tasks initiated with setTimeout() to finish if clearTimeout() is called after they have started execution.
  • The method element.style.getPropertyValue(’foo’) was expecting camel cased property names, rather than delimiter-separated names.

The first three bugs never even made it into a production release, because they were introduced in changes made since the last release (1.11). Only one of the fixes involved adding support for an entirely new feature — the rest were all refinements of existing implementations.

I think this is fairly indicative of how far along HtmlUnit’s DOM and JavaScript support has come, and I expect similar efforts with other JavaScript libraries to become less and less onerous as these refinements get to be smaller and fewer.

Indeed, support for the Prototype library is pretty far along (we also run its unit tests as part of HtmlUnit’s unit tests) and is blocked in many places by small incompatibilities between the Rhino JavaScript engine and the browser JavaScript engines — not by HtmlUnit bugs.

So enjoy the upcoming HtmlUnit release, and let us know if you run into any problems. And if you have the time and inclination, send a couple of patches our way ;-)

Research Paper: HtmlUnit Refactoring

The other day I stumbled across a research paper entitled Digging the Development Dust for Refactorings [1], which addresses software repository data mining. Specifically, the paper identifies four types of data which can be examined — source code metrics, identifiers, ROI estimates, and design differencing — and examines their use in building a refactoring history for a software project. Which project, you ask? HtmlUnit!

From the abstract:

Software repositories are rich sources of information about the software development process. Mining the information stored in them has been shown to provide interesting insights into the history of the software development and evolution. Several different types of information have been extracted and analyzed from different points of view. However, these types of information have not been sufficiently cross-examined to understand how they might complement each other. In this paper, we present a systematic analysis of four aspects of the software repository of an open source project — source-code metrics, identifiers, return-on-investment estimates, and design differencing — to collect evidence about refactorings that may have happened during the project development. In the context of this case study, we comparatively examine how informative each piece of information is towards understanding the refactoring history of the project and how costly it is to obtain.

The authors evaluate their proposed refactoring detection methodology by trying it out on the HtmlUnit repository:

To evaluate the effectiveness of our lightweight refactoring method, we examined an open-source system HTMLUnit. HTMLUnit is a realistic representative example of open-source development. There are nine releases in its history from May 22, 2002 to March 17, 2005. It is quite well documented; in fact, examining the log comments in its CVS-repository history, we found many references to refactorings and their rationale, which is critical for our understanding of the system lifecycle.

So we get kudos on our commit logs. Continuing on into the conclusion:

Based on our HTMLUnit case study, we have found that a heuristic combination of source-code metrics and identifiers-movement analysis — using information easily available on any repository platform — can be quite effective in recovering specific refactorings in the software evolutionary lifecycle, albeit not as accurate as structural analysis of the logical system design and less computationally intensive. An even more interesting finding was that the refactorings omitted by the developers in the system’s documentation were found to be “bad investments of development time” according to our ROI estimate, which implies that developers’ documentation is a good description of the developers’ intention if not of their actual work.

Apparently the authors’ analysis identified 11 refactorings, three of which were not documented in the commit logs. These same three undocumented refactorings were also found to have negative ROIs and less than 50% relevance. The assumption made by the authors is that these three refactorings were accidental: code cleanups or bug fixes that got a little too bloated. So basically we’re pretty good about documenting our refactorings, except when we “accidentally refactor”. Interesting stuff!

[1] C.Schofield, B.Tansey, Z.Xing and E.Stroulia, Digging the Development Dust for Refactorings, Proc. of the 14th International Conference on Program Comprehension, Athens, Greece, June 14-16, 2006.

HtmlUnit and the Principle of Least Surprise

Robert Martin has some constructive criticism for HtmlUnit:

I’m using HtmlUnit to parse and interpret HTML web pages. I’ve been very impressed with this library so far. And I appreciate the hard work and dedication of people who give their software away for free. So, although this blog is a complaint, it should not be misconstrued into anything more than constructive criticism.

What I want to do with HtmlUnit is quite simple. Given a string containing HTML, I’d like to query that HTML for certain tags and attributes… Sweet, simple, uncomplicated. Just create the DOM from an HTML String, and then query that DOM. Unfortunately, HtmlUnit does not appear to be that simple.

Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the JavaScript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.

It does indeed look like Robert had to do some digging in order to bend HtmlUnit to his will. However, I have to say that I’ve never seen anyone else use the library like he has, as a glorified HTML parser. HtmlUnit’s raison d’être is to test web applications — it is essentially a headless browser. Some people use it for screen scraping, because it can handle some pretty ugly HTML and JavaScript. But the further away from its intended use you get, the more trouble you are going to run into.

Take, for example, the complaint that he had to disable JavaScript processing. If we disabled JavaScript processing by default, 95% of our users would take to the streets howling for our blood — and I think I speak for all of us when I say that none of the HtmlUnit devs have a death wish ;-)

Robert’s central complaint is that it’s hard to feed HtmlUnit a String and get back an HtmlPage. The problem is that WebClient, the central HtmlUnit class, does in fact have a getPage(String) method — but it’s geared to the vast majority of users who want to be able to call getPage(”http://my.testable.site”), rather than getPage(”<html><head><title>My Testable Site</title>…</html>”).

Robert claims that HtmlUnit violates the principle of least surprise, i.e. “do the least surprising thing.” Surprise is a subjective beast, but I would submit that catering to the 95% of users who use HtmlUnit as it was intended is definitely the least surprising route to take. That’s not to say that we can’t accomodate alternate uses (perhaps by adding a getPageForHtml(String) to WebClient), but not at the expense of our core competency.

Open Source Project Statistics

As an HtmlUnit committer, every month or so I spend about half an hour searching the web for new mentions of the project, just to get an idea of the latest buzz, unreported problems, unsung praises, etc. During the latest of these searches, I ended up at ohloh, an online directory for open source software projects.

The site lists a large number of projects, and includes interesting metrics for each of them, including codebase size, estimated effort and cost necessary to reproduce the codebase, level of documentation, related projects, user ratings and reviews, etc. Here are some of the metrics, as of today, for HtmlUnit and some related projects:

Project Lines of Code Man Years of Effort Total Cost ($)
HtmlUnit 53,137 13 698,513
HttpUnit 30,967 7 399,128
jWebUnit 9,385 2 111,795
Canoo WebTest 77,505 19 1,024,914

Both jWebUnit and Canoo WebTest are based on HtmlUnit, using it internally to do the HTML / JavaScript heavy lifting. However, jWebUnit appears to be a thin wrapper, while Canoo WebTest is larger than HtmlUnit itself! HttpUnit is somewhat smaller than HtmlUnit, which makes sense: HtmlUnit provides a higher level of abstraction and supports many more JavaScript constructs than HttpUnit does.

It’ll be interesting to see if this site takes off or not. Developers often have to pull information from a wide array of sources in order to make informed decisions as to the libraries that they need to include in their stack. While ohloh provides some of this information, there is one glaring omission: community size and growth trends. This is usually measured via mailing list activity — lots of posts to the mailing lists imply a large user community. It would be nice to have this information listed as well, and not just as a factoid in the summary section.

Regardless of this small omission, ohloh is a fun site to browse. It’s interesting to see their take on the various software projects out there, and it’s entertaining to compare competing libraries (SVN vs CVS!). If lots of people actually start using it to recommend and critique software, it might become an extremely useful website.

« Previous entries