Sunday, July 17, 2005

Don't use java.rmi.server.UID for generating Unique IDs

There is a well known bug in java.rmi.server.UID in all the JDKs prior to Java 5. The problem occurs when two or more JVMs are running on the same host and are started exactly in the same millisecond. We found the bug the hard way while troubleshooting a real production problem that was rather painful.

UID as generated by java.rmi.server.UID is comprised of a host unique JVMID, followed by current time in milliseconds and followed by a sequence number as shown below:

JVMIDTime in milliseconds since Jan 1, 1970Seq No

An example UID is:
29b7fbc2 - 0000-0104-0bb5-afc9 - b2e0
where
29b7fbc2 - JVM Identifier unique to the machine.

0000-0104-0bb5-afc9 - Milliseconds since the epoch of 1970. In this example the value is May 23, 2005 18:39:13 EDT. This will change only when the sequence number rolls over. The JVM does not get the current time for every UID request in order to improve performance. Therefore, the msec time here is the time when the sequence number rolls over NOT the time when the UID was created.

b2e0 - Sequence number which counts from 8000 to 0 to 7FFF (in HEX).

UID is widely used as a starting point for generating unique Ids. We too used the UID: our UID scheme simply appended 3rd and 4th byte of host IP address to the number generated by java.rmi.server.UID. This UID was used as session identifier for web transactions with one of our legacy backend system. This scheme worked fine over a year and half without problem. Suddenly, about 2 months back, we started seeing duplicate sessions ids. Sometimes (we couldn’t nail how) one user would steal the transaction from another user and our backend system would actually commit the transaction incorrectly for the wrong user. We were puzzled and baffled. We looked at system clock for the UNIX server, thinking that the system clock was not ticking and time was not advancing (ruled out later, it started becoming random with respect to UNIX server). May be the time synchronization process for the cluster was setting the system clock back for these servers. After much analysis that was determined not be the case. We looked for similar problems but for the most part we were at a loss.

To give a brief idea of our system: it is a J2EE based very high volume web site. This system is consumer facing, 24 by 7, global e-commerce site and is a major revenue generating channel for the company. It interfaces with Mainframe based legacy systems for core business transactions. Another important aspect, to meet performance requirements, we had to stand up 2 JVMs per host about 8 months back.

Needless to say the heat was on, this problem although very miniscule (only 7 or 8 cases was reported among upwards of five hundred thousand transactions) nonetheless, it had everybody worried and more so since we were clueless. Company’s name and trust was at stake if this continued for long. The customers who’s transactions were messed up, called in our customer care line and got the problem resolved.

We got a lucky break a few weeks back when we were troubleshooting one such occurrence. It came to our attention that this problem started after a cluster switch was performed and ripple start of 2nd set of JVMs was executed. Then somebody mentioned that last time also the problem started after a server restart. That got us thinking – something to do with JVM restarts – at least some semblance of repeatable pattern. We had always fixed the problem by shutting down the offending application servers and re-starting them. What if there is a bug in UID generation? We knew for sure that the last 2 bytes of the UID will be same for all UIDs on a given host. How is JVM id determined? Does it use start up time (we had actually ruled this out earlier assuming Java will have a smarter way of guaranteeing JVM uniqueness on a host and must not be based on JVM startup time. Bad assumption)? We googled “bug in java.rmi.server.UID” and there it was !! (we love google, the best customer support and knowledge base in the world). We quickly copied the programs posted with the bug in the Sun’s Java bug database to reproduce the problem and sure enough we did it very reliably. So under some circumstances our ripple start scripts were starting the two application servers (or JVMs) on a given host in the cluster exactly at the same millisecond and when it happened (which was very random to us since we have 11 servers with 26 JVMs in our primary cluster) the two JVMs would start generating duplicate UIDs.

Moral of the story, don’t rule out bugs in core Java libraries and don’t assume that Sun is always very smart and will always have elegant solutions. And please use google or search Sun's Java bug database for finding Java bugs first. It might just save many a heart burns and sleepless nights. We were dumbfounded to learn that this bug has been in Java since its inception and the bug was first filed in 1999. Second all of us were looking rather chastised and humbled, how come we didn’t know? Most of us have been working with Java since 1998.

Note: Please see the bug description for details of the bug and the circumstances under which it can occur. Our fix was that we wrote our own UID generator. Java 5 fixes the problem by using new SecureRandom().nextInt() (see source code in src.zip in $JAVA_HOME distributed with JDK 5, src/java/rmi/server/UID.java) for JVM unique id instead of new Object().hashCode() (see source code in src.zip in $JAVA_HOME distributed with JDK 1.4, 1.3, 1.2 etc., src/java/rmi/server/UID.java) in pre Java 5 versions.

Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?