Dynamic RMI-IIOP Optimization


Last Modified: May 20, 2004

Version: 1.1

<Ken Cavanaugh>

Now that we have Dynamic RMI-IIOP (DRI) functioning very well, it is time to consider optimizations. I use the simpleperf test to measure the cost of optimized co-located calls, which is the only case where the impact of DRI is significant. We can measure this with static stubs, with dynamic stubs and no optimized copyobject, and dynamic stubs and optimized copyobject. The results are very informative.

Test Type Static Stubs Dynamic Stubs
Optimized Copy
Dynamic Stubs
Stream Copy
BCEL Stubs
Optimized Copy
BCEL Stubs
No Copy
colocated normal POA 60.3 139.8 865.7 129.1 70.9
full caching 28.2/4.4 22.3/21.2 315.2/285.2 61/16.5 61/6.8
info only caching 1.9/1.8 20.9/11.6 278/264.8 18.4/11.1 6.6/5.0
minimal caching .9/.9 55.8/34.4? 318.3/374.7? 45.8/33.5? 10.8/3.8
same machine
different ORB
1540.6 1573.8 1527 1559 1440

All times are from my Linux machine libros, which is configured as follows:

I have also run the same test on the same machine with JDK 1.5.0 beta 2

Test Type
Static Stubs
Dynamic Stubs
Optimized Copy
Dynamic Stubs
Stream Copy
BCEL Stubs
Optimized Copy
BCEL Stubs
No Copy
colocated normal POA
16.3


34.5
22.8
full caching
2.7/1.6


12.5/10.5
3.6/2.5
info only caching
1.2/.8


8.9/8.7
2/1.4
minimal caching
.6/.4


26.3/26
1.3/1
same machine
different ORB
597


546
563

Generally 1.5.0 is fully 3 times faster than 1.4.2_04.  The ratio is usually preserved, but note that the full servant caching BCEL number is not improved as much as in the static stub case.

All times are given in microseconds. When two times are given, the first is a straight creation, and the second is a creation followed by a object->string->object conversion, which seems to cause the test to run faster (I have no idea why). In all cases, the invocation tests were run 10000 times, and these numbers reflect the average time per invocation. The test implementation does nothing: it just receives a single long argument and returns what it received.

A number of things can be learned from these numbers (which have been pretty consistent every time I have run them). First, the "resolve" sometimes makes a large difference, particularly in the full caching case. This is a puzzle. Another surprise is that the dynamic stubs/optimized copy/minimal caching case is so slow. There does appear to be much more variability in the dynamic stub numbers, which may indicate more GC activity, but I have not as yet explored this.

The largest component of the time is now obvious from the 1.5 numbers: calling copyObjects costs a lot of time, even with the highly optimized copyObject implementation. In fact, without the copyObject overhead, the dynamic case takes barely twice as much time as the static.

Taking just the full caching and info only caching, we see that the difference between dynamic and static ranges from about 6-19 microseconds per call. This is the cost of using reflection and packaging all of the arguments in native wrappers and arrays, and also the cost of copyObject(s) calls. copyObject is particularly important, since the overhead of not using the optimized copyObject code is about 250-300 microseconds per call. This seems to be the cost of creating 2 ORB streams (one for argument copying and one for copying the result). The tests in the same machine/different ORB case show no consistent differences, leading me to believe that the stream creation overhead is similar in both cases, as in both cases we create an argument and a result stream, and the processing is nearly the same. This looks like 20% of the cost of a remote invocation (with a very fast network) is due to stream creation. We could also run the local/different ORB tests with Java serialization and see what the difference is, but I have not done this so far.

There are significant opportunities for optimization in the current implementation. I will focus on the BCEL proxy implementation, since that is the one we will ship. It also appears likely that there is significantly more opportunity there for optimization than in the proxy-based implementation. We will still preserve the JDK proxy implementation for the present. It is becoming apparent that my original idea of using the same InvocationHandler for both JDK and BCEL proxies may not work, and it definitely not optimal.

One big optimization is possible in the InvocationHandlerFactory. This currently constructs an InvocationHandler that dispatches requests to the remote interfaes, as well as DynamicStub, org.omg.CORBA.Object, and java.lang.Object. This is necessary because JDK proxies all extend java.lang.reflect.Proxy, so the base class cannot do anything useful for the application. However, the BCEL proxies extend BCELStubBase which extends javax.rmi.CORBA.Stub, and Stub implements everything needed except for the remote interfaces themselves. This means that we never need to do an InvocationHandler lookup in the BCEL proxy case. In fact, the BCEL proxy just goes directly to the StubInvocationHandler, rather than the more complex composite invocation handler returned by the InvocationHandlerFactoryImpl class.

Another area to look at is the doPrivileged block that calls setAccessible in the StubInvocationHandlerImpl code. This is happening on every call. However, there is a simple solution: just check isAccessible() before doing the setAccessible call. isAccessible simply returns a boolean in the JDK, so its cost is trivial. This avoids doing the doPrivilege block on every call.

I also need to look at the implementation of copyObject(s). Optimizing this further will be difficult, as the code is already fairly close to optimal for what it does (without pushing further into dynamic code generation for class copiers). We can determine when a copyObjects call is made with nothing but wrapped primitives and avoid the array copy in that case. We cannot skip copying wrapped primitives, but the code handles those as immutables anyway.

A more difficult optimization would be to change the BCEL generated proxy code so that a different dispatch path was used for local calls, one that avoided wrapping and allocating an argument array completely. I'll probably explore this again, but this will not be possible before the Milestone 4 deadline.