6.1. Unary Operations¶
Here we examine times for basic unary operations on float
and int
.
The integers in these measurements are arbitrary precision integers
(BigInteger
or a Python type based on it)
for comparability.
6.1.1. VSJ 2 evo 4¶
The measurements are of the abstract API that supports the Python byte code interpreter. If there were a compiler for VSJ 2, it would probably emit calls to these methods.
In this test we are measuring the execution time of Number.negative
,
and comparing it with the an equivalent calculation in pure Java,
where the type of the operand is known statically.
This will tell us how much we are paying for the dynamic typing in Python,
in a particular implementation.
Benchmark Mode Cnt Score Error Units
PyFloatUnary.neg avgt 20 23.966 ± 0.306 ns/op
PyFloatUnary.neg_java avgt 200 5.616 ± 0.041 ns/op
PyFloatUnary.nothing avgt 200 5.650 ± 0.053 ns/op
PyLongUnary.neg avgt 20 34.552 ± 0.264 ns/op
PyLongUnary.neg_java avgt 100 16.676 ± 0.098 ns/op
PyLongUnary.negbig avgt 20 35.507 ± 0.636 ns/op
PyLongUnary.negbig_java avgt 20 16.537 ± 0.088 ns/op
The invocation overhead is consistently around 18ns on this machine, over the basic cost of the pure Java operation. The test fixture code looks (typically) like this:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(2)
@Warmup(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
public class PyFloatUnary {
double v = 42.0;
PyObject vo = Py.val(v);
@Benchmark
@Fork(4) // Needs a lot of iterations to resolve short times
@Measurement(iterations = 50)
public double nothing() { return v; }
@Benchmark
@Fork(4) // Needs a lot of iterations to resolve short times
@Measurement(iterations = 50)
public double neg_java() { return -v; }
@Benchmark
public PyObject neg() throws Throwable {
return Number.negative(vo);
}
// ...
We have a test method on the API call,
and a nearest equivalent Java method (suffix _java
).
The name is chosen to make them adjacent in the results table.
A result must always be returned that depends on all the code under test, so that JMH can convince Java that the result will be used elsewhere, and that it cannot be optimised away to nothing.
We also measured a nothing
(empty body) method
as a check on the overheads of simply calling a method under test.
Notice that the costs of PyFloatUnary.neg_java
and of nothing
are not significantly different.
If there is a difference, it is less than one clock cycle of the CPU.
We give an explanation for this in the section
Vanishing Time in Primitive Operations.
6.1.2. Jython 2.7.2¶
We measure PyObject.__neg__
as this is comparable in purpose to the VSJ 2 abstract API,
and is called from compiled Python.
The integer arguments in this measurement are Python 2 long
for comparability with the Python 3 int
.
They are implemented as BigInteger
.
Benchmark Mode Cnt Score Error Units
PyFloatUnary.neg avgt 20 15.095 ± 0.090 ns/op
PyFloatUnary.neg_java avgt 200 5.400 ± 0.039 ns/op
PyFloatUnary.nothing avgt 200 5.653 ± 0.058 ns/op
PyLongUnary.neg avgt 20 24.859 ± 0.214 ns/op
PyLongUnary.neg_java avgt 20 16.601 ± 0.144 ns/op
PyLongUnary.negbig avgt 20 24.725 ± 0.142 ns/op
PyLongUnary.negbig_java avgt 20 16.640 ± 0.086 ns/op
We can see that the invocation overhead of the Jython 2 approach is 8-9ns on this machine.
6.1.3. VSJ 2 evo 4 with invokedynamic
¶
We measure a specially-generated equivalent to Number.negative
,
that contains just an invokedynamic
instruction,
and will become linked to a mutable call site at run time.
We do not yet have a compiler for Python that would generate that code,
but this allows us to benchmark the fragment we expect one to emit.
The call site becomes specialised to invoke Slot.op_neg
from the type (or types) encountered,
and therefore we call the same VSJ 2 implementation of __neg__
that was engaged in the plain VSJ 2 benchmark.
Only the linkage and call mechanisms are different.
In particular, Python int
is still implemented using BigInteger
.
Benchmark Mode Cnt Score Error Units
PyFloatUnary.neg avgt 20 14.273 ± 0.161 ns/op
PyFloatUnary.neg_java avgt 200 5.492 ± 0.031 ns/op
PyFloatUnary.nothing avgt 200 5.562 ± 0.064 ns/op
PyLongUnary.neg avgt 20 25.405 ± 0.555 ns/op
PyLongUnary.neg_java avgt 100 16.691 ± 0.108 ns/op
PyLongUnary.negbig avgt 20 24.763 ± 0.489 ns/op
PyLongUnary.negbig_java avgt 20 16.482 ± 0.093 ns/op
We can see that the invocation overhead of the dynamic implementations relative to pure Java is about 9ns on this machine.
The calls benchmarked are to this method, generated using ASM, intended to mimic what we would expect a compiler to output:
public class uk.co.farowl.vsj2dy.evo4.AbstractProxy {
public static PyObject negative(PyObject);
Code:
0: aload_0
1: invokedynamic #15, 0
// InvokeDynamic #0:negative:(LPyObject;)LPyObject;
6: areturn
...
}
In the disassembly the package name prefixes uk/co/farowl/vsj2/evo4/
and uk.co.farowl.vsj2.evo4.
have been elided and lines broken for the sake of readability.
The generated class AbstractProxy is used in the benchmarks
in place of the abstract numeric API Number
:
public class PyFloatUnary {
double v = 42.0;
PyObject vo = Py.val(v);
@Benchmark
@Fork(4) // Needs a lot of iterations to resolve short times
@Measurement(iterations = 50)
public double nothing() { return v; }
@Benchmark
public PyObject neg() throws Throwable {
return AbstractProxy.negative(vo);
}
@Benchmark
@Fork(4) // Needs a lot of iterations to resolve short times
@Measurement(iterations = 50)
public double neg_java() { return -v; }
// ...
The MutableCallSite
specialisation on the receiving end
is straight out of the textbook in the unary case (some set-up removed):
static class UnaryOpCallSite extends MutableCallSite {
//...
private final Slot op;
public UnaryOpCallSite(Slot op)
throws NoSuchMethodException, IllegalAccessException {
super(UOP);
this.op = op;
setTarget(fallbackMH.bindTo(this));
}
private PyObject fallback(PyObject v) throws Throwable {
PyType vType = v.getType();
MethodHandle resultMH, targetMH;
if (op.isDefinedFor(vType)) {
resultMH = op.getSlot(vType);
} else {
resultMH = OPERAND_ERROR.bindTo(op);
}
// MH for guarded invocation (becomes new target)
MethodHandle guardMH = CLASS_GUARD.bindTo(v.getClass());
targetMH = guardWithTest(guardMH, resultMH, getTarget());
setTarget(targetMH);
// Compute the result for this case
return (PyObject) resultMH.invokeExact(v);
}
//...
6.1.4. VSJ 3 evo 1¶
VSJ 3 is the “plain Java object” implementation.
There is no PyObject
that all Python objects extend or implement.
We associate Python types with classes through a ClassValue
,
that permits a BigInteger
to be recognised directly as an int
,
for example,
and a Double
as a float
.
As in VSJ 2,
each operation of which an object is capable,
is accessed through a MethodHandle
stored in a data structure that describes the Python type.
Since the Python type is no longer written on the object, in VSJ 3,
finding the handle is less direct than in VSJ 2,
and we should expect the extra work
(a call to ClassValue.get()
)
to show in the time taken to invoke the operation.
Benchmark Mode Cnt Score Error Units
PyFloatUnary.neg avgt 20 31.125 ± 0.377 ns/op
PyFloatUnary.neg_java avgt 200 5.646 ± 0.050 ns/op
PyFloatUnary.nothing avgt 200 5.773 ± 0.070 ns/op
PyLongUnary.neg avgt 20 26.226 ± 0.809 ns/op
PyLongUnary.neg_java avgt 100 5.441 ± 0.052 ns/op
PyLongUnary.negbig avgt 20 32.605 ± 0.663 ns/op
PyLongUnary.negbig_java avgt 20 16.509 ± 0.109 ns/op
Compared with VSJ 2 evo4,
the overhead for float
has indeed increased to 25ns (up from around 18ns),
but in fact we are doing slightly better than VSJ 2 with int
.
This will count when we are interpreting CPython byte code.
We have no measurements (at the time of writing)
to tell us whether this is important
relative to the overhead of the interpreter loop.
The comparison with VSJ 2 is not quite direct,
since in VSJ 3 we represent int
by Integer
,
if the value is not too big.
This saves work in PyLongUnary.neg
.
Its comparator PyLongUnary.neg_java
is written using a primitive Java int
.
6.1.5. VSJ 3 evo 1 with invokedynamic
¶
VSJ 3 also supports binding the MethodHandle
s
into invokedynamic
call sites.
The mechanism for doing so is more complex
than the one we layered onto VSJ 2,
but in return we create the possibility of binding versions
specialised to the argument(s).
For example, the call site in PyLongUnary.neg
will be bound to a method with signature Object __neg__(Integer)
.
Binding is a one-time cost (per call site and type).
Benchmark Mode Cnt Score Error Units
PyFloatUnary.neg avgt 20 12.590 ± 0.108 ns/op
PyFloatUnary.neg_java avgt 200 5.511 ± 0.025 ns/op
PyFloatUnary.nothing avgt 200 5.612 ± 0.053 ns/op
PyLongUnary.neg avgt 20 12.913 ± 0.051 ns/op
PyLongUnary.neg_java avgt 100 5.408 ± 0.053 ns/op
PyLongUnary.negbig avgt 20 16.752 ± 0.341 ns/op
PyLongUnary.negbig_java avgt 20 16.544 ± 0.117 ns/op
For float
and small int
the overhead is just 7ns,
while for int
big enough to need a BigInteger
,
we there seems to be no overhead at all.
6.1.6. Analysis¶
Basic Slot Dispatch¶
The plain VSJ 2 implementation dispatches through a MethodHandle
in the following way:
public class Number extends Abstract { // ...
public static PyObject negative(PyObject v) throws Throwable {
try {
return (PyObject) v.getType().op_neg.invokeExact(v);
} catch (Slot.EmptyException e) {
throw operandError("unary -", v);
}
}
// ...
class PyFloat extends AbstractPyObject { // ...
private PyObject __neg__() { return new PyFloat(-value); }
The 18ns that this dispatch costs in VSJ 2 on the test machine is not very much time: not enough to create frames for the apparent depth of call the stack. We explain this in terms of in-lining carried out by Java HotSpot.
For example, in the floating-point benchmark,
we should expect Number.negative
to have been in-lined at the call site,
and specialised for PyFloat
.
At the same time, the PyFloat
constructor call will have been
in-lined in __neg__
.
The residual time probably consists of a guard
(a check that v
is in fact a PyFloat
),
and a call to TYPE.op_neg.invokeExact
on the optimised handle.
In comparison, Jython 2 dispatch consists of a Java virtual method call
to PyObject.__neg__
,
overridden by PyFloat.__neg__
,
which itself has essentially the same form as in the VSJ 2 implementation.
This dispatch costs only about 8ns,
suggesting that the virtual call is fully in-lined,
and specialised to PyFloat
,
after a simple guard on type.
The invokeExact
Barrier¶
Jython is significantly quicker than plain VSJ 2.
It begins to look as if the called implementation of Number.negative
cannot be in-lined across an invokeExact
call.
The equivalent path in VSJ 3 displays the same problem.
Why might this be?
Inlining is not safe here because
Java cannot tell that the handle stored in a PyType
will not change.
We cannot declare it final
in PyType
generally,
since in some types (although not int
or float
)
the handle will be re-written
when __neg__
is defined in the called type or an ancestor.
In Python this can happen at any time.
This sets a limit to what can be expected of interpreted CPython byte code.
Recovery with invokedynamic
¶
Turning now to VSJ 2 with invokedynamic
,
performance recovers to equal that of Jython 2,
suggesting that the JVM is successfully in-lining the method handles
installed by the UnaryOpCallSite
.
We apply a class-guard that wraps the op_neg
handle,
and falls back to a method that will look for the correct handle.
When the JVM specialises the in-lined call in Jython 2,
it too must check the specialisation applies to each new argument.
The timings tell us the checks in VSJ 2 cost no more than those in Jython 2.
By installing the handle on __neg__
as the target of the call site,
the run-time system implicitly guarantees to the JVM that in-lining is safe.
We have to use a mutable call site,
one where the handle may change,
because a new class may come along at any time and fail the class guard.
Then we will re-write the target,
and the JVM will respond by adapting the in-line code.
The call site we implemented in VSJ 2 with invokedynamic
is incorrect in this respect.
It assumes no re-definition of __neg__
will occur,
which is correct for the types in the test but will not do long-term.
For types that allow re-assignment of special functions
(something the type object must indicate is a possibility),
and for types that allow object type to be changed,
a different handle should be installed
that always goes via op_neg
in the type the object (currently) has.
The VSJ 3 call site is designed to allow for mutable types.
For immutable types like int
and float
it will install the fast path.
Dispatch Specific to Java Class¶
The other big difference in VSJ 3 from VSJ 2 is the adoption of multiple types as implementations of a built-in Python object type. Operations are defined separately for each implementing Java class, and it is these definitions that the abstract API will invoke.
public class Number extends Abstract { // ...
public static PyObject negative(PyObject v) throws Throwable {
try {
return Operations.of(v).op_neg.invokeExact(v);
} catch (Slot.EmptyException e) {
throw operandError(Slot.op_neg, v);
}
}
// ...
class PyFloatMethods { // ...
static Object __neg__(PyFloat self) { return -self.value; }
static Object __neg__(Double self) { return -self.doubleValue(); }
The PyFloat
implementation exists so that
we may sub-class float
in Python.
A method is defined for all implementing classes (or for a super-class),
or it is not defined for the type,
and the slot will then either be inherited or be empty.
When binding the target for a newly-encountered call into a call site,
the site will find the definition for that class
and (for immutable types) bind that directly.
If the method is not defined,
it will bind one that raises a Python TypeError
.
It is clear in the timings that specialisation and
the simpler MethodHandle
s
do reduce the overhead as hoped,
beating Jython 2 by a small margin.