6.1. Unary Operations

Here we examine times for basic unary operations on float and int. The integers in these measurements are arbitrary precision integers (BigInteger or a Python type based on it) for comparability.

6.1.1. VSJ 2 evo 4

The measurements are of the abstract API that supports the Python byte code interpreter. If there were a compiler for VSJ 2, it would probably emit calls to these methods.

In this test we are measuring the execution time of Number.negative, and comparing it with the an equivalent calculation in pure Java, where the type of the operand is known statically. This will tell us how much we are paying for the dynamic typing in Python, in a particular implementation.

Benchmark                Mode  Cnt   Score   Error  Units
PyFloatUnary.neg         avgt   20  23.966 ± 0.306  ns/op
PyFloatUnary.neg_java    avgt  200   5.616 ± 0.041  ns/op
PyFloatUnary.nothing     avgt  200   5.650 ± 0.053  ns/op
PyLongUnary.neg          avgt   20  34.552 ± 0.264  ns/op
PyLongUnary.neg_java     avgt  100  16.676 ± 0.098  ns/op
PyLongUnary.negbig       avgt   20  35.507 ± 0.636  ns/op
PyLongUnary.negbig_java  avgt   20  16.537 ± 0.088  ns/op

The invocation overhead is consistently around 18ns on this machine, over the basic cost of the pure Java operation. The test fixture code looks (typically) like this:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)

@Fork(2)
@Warmup(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)

@State(Scope.Thread)
public class PyFloatUnary {

    double v = 42.0;
    PyObject vo = Py.val(v);

    @Benchmark
    @Fork(4)  // Needs a lot of iterations to resolve short times
    @Measurement(iterations = 50)
    public double nothing() { return v; }

    @Benchmark
    @Fork(4)  // Needs a lot of iterations to resolve short times
    @Measurement(iterations = 50)
    public double neg_java() { return -v; }

    @Benchmark
    public PyObject neg() throws Throwable {
        return Number.negative(vo);
    }
    // ...

We have a test method on the API call, and a nearest equivalent Java method (suffix _java). The name is chosen to make them adjacent in the results table.

A result must always be returned that depends on all the code under test, so that JMH can convince Java that the result will be used elsewhere, and that it cannot be optimised away to nothing.

We also measured a nothing (empty body) method as a check on the overheads of simply calling a method under test. Notice that the costs of PyFloatUnary.neg_java and of nothing are not significantly different. If there is a difference, it is less than one clock cycle of the CPU. We give an explanation for this in the section Vanishing Time in Primitive Operations.

6.1.2. Jython 2.7.2

We measure PyObject.__neg__ as this is comparable in purpose to the VSJ 2 abstract API, and is called from compiled Python. The integer arguments in this measurement are Python 2 long for comparability with the Python 3 int. They are implemented as BigInteger.

Benchmark                Mode  Cnt   Score   Error  Units
PyFloatUnary.neg         avgt   20  15.095 ± 0.090  ns/op
PyFloatUnary.neg_java    avgt  200   5.400 ± 0.039  ns/op
PyFloatUnary.nothing     avgt  200   5.653 ± 0.058  ns/op
PyLongUnary.neg          avgt   20  24.859 ± 0.214  ns/op
PyLongUnary.neg_java     avgt   20  16.601 ± 0.144  ns/op
PyLongUnary.negbig       avgt   20  24.725 ± 0.142  ns/op
PyLongUnary.negbig_java  avgt   20  16.640 ± 0.086  ns/op

We can see that the invocation overhead of the Jython 2 approach is 8-9ns on this machine.

6.1.3. VSJ 2 evo 4 with invokedynamic

We measure a specially-generated equivalent to Number.negative, that contains just an invokedynamic instruction, and will become linked to a mutable call site at run time. We do not yet have a compiler for Python that would generate that code, but this allows us to benchmark the fragment we expect one to emit.

The call site becomes specialised to invoke Slot.op_neg from the type (or types) encountered, and therefore we call the same VSJ 2 implementation of __neg__ that was engaged in the plain VSJ 2 benchmark. Only the linkage and call mechanisms are different. In particular, Python int is still implemented using BigInteger.

Benchmark                Mode  Cnt   Score   Error  Units
PyFloatUnary.neg         avgt   20  14.273 ± 0.161  ns/op
PyFloatUnary.neg_java    avgt  200   5.492 ± 0.031  ns/op
PyFloatUnary.nothing     avgt  200   5.562 ± 0.064  ns/op
PyLongUnary.neg          avgt   20  25.405 ± 0.555  ns/op
PyLongUnary.neg_java     avgt  100  16.691 ± 0.108  ns/op
PyLongUnary.negbig       avgt   20  24.763 ± 0.489  ns/op
PyLongUnary.negbig_java  avgt   20  16.482 ± 0.093  ns/op

We can see that the invocation overhead of the dynamic implementations relative to pure Java is about 9ns on this machine.

The calls benchmarked are to this method, generated using ASM, intended to mimic what we would expect a compiler to output:

public class uk.co.farowl.vsj2dy.evo4.AbstractProxy {
  public static PyObject negative(PyObject);
    Code:
       0: aload_0
       1: invokedynamic #15,  0
                    // InvokeDynamic #0:negative:(LPyObject;)LPyObject;
       6: areturn
...
}

In the disassembly the package name prefixes uk/co/farowl/vsj2/evo4/ and uk.co.farowl.vsj2.evo4. have been elided and lines broken for the sake of readability. The generated class AbstractProxy is used in the benchmarks in place of the abstract numeric API Number:

public class PyFloatUnary {

    double v = 42.0;
    PyObject vo = Py.val(v);

    @Benchmark
    @Fork(4)  // Needs a lot of iterations to resolve short times
    @Measurement(iterations = 50)
    public double nothing() { return v; }

    @Benchmark
    public PyObject neg() throws Throwable {
        return AbstractProxy.negative(vo);
    }

    @Benchmark
    @Fork(4)  // Needs a lot of iterations to resolve short times
    @Measurement(iterations = 50)
    public double neg_java() { return -v; }
    // ...

The MutableCallSite specialisation on the receiving end is straight out of the textbook in the unary case (some set-up removed):

static class UnaryOpCallSite extends MutableCallSite {
    //...

    private final Slot op;

    public UnaryOpCallSite(Slot op)
            throws NoSuchMethodException, IllegalAccessException {
        super(UOP);
        this.op = op;
        setTarget(fallbackMH.bindTo(this));
    }

    private PyObject fallback(PyObject v) throws Throwable {
        PyType vType = v.getType();
        MethodHandle resultMH, targetMH;

        if (op.isDefinedFor(vType)) {
            resultMH = op.getSlot(vType);
        } else {
            resultMH = OPERAND_ERROR.bindTo(op);
        }

        // MH for guarded invocation (becomes new target)
        MethodHandle guardMH = CLASS_GUARD.bindTo(v.getClass());
        targetMH = guardWithTest(guardMH, resultMH, getTarget());
        setTarget(targetMH);

        // Compute the result for this case
        return (PyObject) resultMH.invokeExact(v);
    }
    //...

6.1.4. VSJ 3 evo 1

VSJ 3 is the “plain Java object” implementation. There is no PyObject that all Python objects extend or implement. We associate Python types with classes through a ClassValue, that permits a BigInteger to be recognised directly as an int, for example, and a Double as a float.

As in VSJ 2, each operation of which an object is capable, is accessed through a MethodHandle stored in a data structure that describes the Python type. Since the Python type is no longer written on the object, in VSJ 3, finding the handle is less direct than in VSJ 2, and we should expect the extra work (a call to ClassValue.get()) to show in the time taken to invoke the operation.

Benchmark                Mode  Cnt   Score   Error  Units
PyFloatUnary.neg         avgt   20  31.125 ± 0.377  ns/op
PyFloatUnary.neg_java    avgt  200   5.646 ± 0.050  ns/op
PyFloatUnary.nothing     avgt  200   5.773 ± 0.070  ns/op
PyLongUnary.neg          avgt   20  26.226 ± 0.809  ns/op
PyLongUnary.neg_java     avgt  100   5.441 ± 0.052  ns/op
PyLongUnary.negbig       avgt   20  32.605 ± 0.663  ns/op
PyLongUnary.negbig_java  avgt   20  16.509 ± 0.109  ns/op

Compared with VSJ 2 evo4, the overhead for float has indeed increased to 25ns (up from around 18ns), but in fact we are doing slightly better than VSJ 2 with int. This will count when we are interpreting CPython byte code. We have no measurements (at the time of writing) to tell us whether this is important relative to the overhead of the interpreter loop.

The comparison with VSJ 2 is not quite direct, since in VSJ 3 we represent int by Integer, if the value is not too big. This saves work in PyLongUnary.neg. Its comparator PyLongUnary.neg_java is written using a primitive Java int.

6.1.5. VSJ 3 evo 1 with invokedynamic

VSJ 3 also supports binding the MethodHandles into invokedynamic call sites. The mechanism for doing so is more complex than the one we layered onto VSJ 2, but in return we create the possibility of binding versions specialised to the argument(s). For example, the call site in PyLongUnary.neg will be bound to a method with signature Object __neg__(Integer). Binding is a one-time cost (per call site and type).

Benchmark                Mode  Cnt   Score   Error  Units
PyFloatUnary.neg         avgt   20  12.590 ± 0.108  ns/op
PyFloatUnary.neg_java    avgt  200   5.511 ± 0.025  ns/op
PyFloatUnary.nothing     avgt  200   5.612 ± 0.053  ns/op
PyLongUnary.neg          avgt   20  12.913 ± 0.051  ns/op
PyLongUnary.neg_java     avgt  100   5.408 ± 0.053  ns/op
PyLongUnary.negbig       avgt   20  16.752 ± 0.341  ns/op
PyLongUnary.negbig_java  avgt   20  16.544 ± 0.117  ns/op

For float and small int the overhead is just 7ns, while for int big enough to need a BigInteger, we there seems to be no overhead at all.

6.1.6. Analysis

Basic Slot Dispatch

The plain VSJ 2 implementation dispatches through a MethodHandle in the following way:

public class Number extends Abstract { // ...
    public static PyObject negative(PyObject v) throws Throwable {
        try {
            return (PyObject) v.getType().op_neg.invokeExact(v);
        } catch (Slot.EmptyException e) {
            throw operandError("unary -", v);
        }
    }
    // ...
class PyFloat extends AbstractPyObject { // ...

    private PyObject __neg__() { return new PyFloat(-value); }

The 18ns that this dispatch costs in VSJ 2 on the test machine is not very much time: not enough to create frames for the apparent depth of call the stack. We explain this in terms of in-lining carried out by Java HotSpot.

For example, in the floating-point benchmark, we should expect Number.negative to have been in-lined at the call site, and specialised for PyFloat. At the same time, the PyFloat constructor call will have been in-lined in __neg__. The residual time probably consists of a guard (a check that v is in fact a PyFloat), and a call to TYPE.op_neg.invokeExact on the optimised handle.

In comparison, Jython 2 dispatch consists of a Java virtual method call to PyObject.__neg__, overridden by PyFloat.__neg__, which itself has essentially the same form as in the VSJ 2 implementation. This dispatch costs only about 8ns, suggesting that the virtual call is fully in-lined, and specialised to PyFloat, after a simple guard on type.

The invokeExact Barrier

Jython is significantly quicker than plain VSJ 2. It begins to look as if the called implementation of Number.negative cannot be in-lined across an invokeExact call. The equivalent path in VSJ 3 displays the same problem. Why might this be?

Inlining is not safe here because Java cannot tell that the handle stored in a PyType will not change. We cannot declare it final in PyType generally, since in some types (although not int or float) the handle will be re-written when __neg__ is defined in the called type or an ancestor. In Python this can happen at any time.

This sets a limit to what can be expected of interpreted CPython byte code.

Recovery with invokedynamic

Turning now to VSJ 2 with invokedynamic, performance recovers to equal that of Jython 2, suggesting that the JVM is successfully in-lining the method handles installed by the UnaryOpCallSite. We apply a class-guard that wraps the op_neg handle, and falls back to a method that will look for the correct handle. When the JVM specialises the in-lined call in Jython 2, it too must check the specialisation applies to each new argument. The timings tell us the checks in VSJ 2 cost no more than those in Jython 2.

By installing the handle on __neg__ as the target of the call site, the run-time system implicitly guarantees to the JVM that in-lining is safe. We have to use a mutable call site, one where the handle may change, because a new class may come along at any time and fail the class guard. Then we will re-write the target, and the JVM will respond by adapting the in-line code.

The call site we implemented in VSJ 2 with invokedynamic is incorrect in this respect. It assumes no re-definition of __neg__ will occur, which is correct for the types in the test but will not do long-term. For types that allow re-assignment of special functions (something the type object must indicate is a possibility), and for types that allow object type to be changed, a different handle should be installed that always goes via op_neg in the type the object (currently) has.

The VSJ 3 call site is designed to allow for mutable types. For immutable types like int and float it will install the fast path.

Dispatch Specific to Java Class

The other big difference in VSJ 3 from VSJ 2 is the adoption of multiple types as implementations of a built-in Python object type. Operations are defined separately for each implementing Java class, and it is these definitions that the abstract API will invoke.

public class Number extends Abstract { // ...
    public static PyObject negative(PyObject v) throws Throwable {
        try {
            return Operations.of(v).op_neg.invokeExact(v);
        } catch (Slot.EmptyException e) {
            throw operandError(Slot.op_neg, v);
        }
    }
    // ...
class PyFloatMethods { // ...

    static Object __neg__(PyFloat self) { return -self.value; }
    static Object __neg__(Double self) { return -self.doubleValue(); }

The PyFloat implementation exists so that we may sub-class float in Python. A method is defined for all implementing classes (or for a super-class), or it is not defined for the type, and the slot will then either be inherited or be empty.

When binding the target for a newly-encountered call into a call site, the site will find the definition for that class and (for immutable types) bind that directly. If the method is not defined, it will bind one that raises a Python TypeError.

It is clear in the timings that specialisation and the simpler MethodHandles do reduce the overhead as hoped, beating Jython 2 by a small margin.