URL.hashCode() Considered Harmful

I just cut HtmlUnit’s build time by about 20% by changing four lines of code. How? HtmlUnit keeps a small cache of web requests in a HashMap, keyed on the request URL. The problem with this is twofold:

  1. The URL.hashCode() method is synchronized.
  2. The URL.hashCode() method triggers DNS lookups for the URL hosts.

The impact of item 2 was magnified by the fact that some of the HtmlUnit unit tests use a mock web connection to connect to fake URLs. DNS (non)resolution of these fake URLs took an especially long time.

The fix was to key the map entries on the value of URL.toString() instead. Apparently I’m not the first person to stumble across this problem. So think twice before coding your next HashMap<URL, XXX> ;-)

5 Comments

  1. Robert O'Connor said,

    April 24, 2008 at 3:23 am

    Josh Bloch talked about this as one of the java puzzlers videos — o don’t remember the exact one either search youtube or google video (it’s one of the google tech talks)

  2. Marc Guillemot said,

    April 24, 2008 at 10:17 am

    Cool!

    The most efficient way would be probably to provide our own URLStreamHandler to allow a custom implementation of hashCode() which result value is cached by URL. Indeed the result of URL.toString() is not cached (and uses a StringBuffer rather than a StringBuilder).

  3. Geoffrey Wiseman said,

    April 24, 2008 at 2:37 pm

    Yes, many people suggest sticking to URI instead of URL for that reason among others.

  4. Daniel Gredler said,

    April 24, 2008 at 3:26 pm

    Robert: Interesting, I’ll have to look for that video.

    Marc: We can write our own getKey(URL) which does the same thing as toString() with a StringBuilder instead of a StringBuffer. As far as I can tell, however, customizing the handler requires either the use of one of the more involved URL constructors, or a JVM-global URLStreamHandlerFactory change — which I’m not sure a third party library should be doing. I’m also unsure how much benefit the hashCode caching does (once you avoid doing DNS lookups!).

  5. Marc Guillemot said,

    April 25, 2008 at 2:27 am

    I don’t think that we really need to improve this: the gain would be minimal and we have surely other areas that would be far more interesting to optimize,

Post a Comment