..  performance/unary-operations.rst


Unary Operations
################

Here we examine times for basic unary operations on ``float`` and ``int``.
The integers in these measurements are arbitrary precision integers
(``BigInteger`` or a Python type based on it)
for comparability.


.. _benchmark-unary-vsj2:

VSJ 2 evo 4
***********

The measurements are of the abstract API
that supports the Python byte code interpreter.
If there were a compiler for VSJ 2,
it would probably emit calls to these methods.

In this test we are measuring the execution time of ``Number.negative``,
and comparing it with the an equivalent calculation in pure Java,
where the type of the operand is known statically.
This will tell us how much we are paying for the dynamic typing in Python,
in a particular implementation.

..  code:: none

    Benchmark                Mode  Cnt   Score   Error  Units
    PyFloatUnary.neg         avgt   20  23.966 ± 0.306  ns/op
    PyFloatUnary.neg_java    avgt  200   5.616 ± 0.041  ns/op
    PyFloatUnary.nothing     avgt  200   5.650 ± 0.053  ns/op
    PyLongUnary.neg          avgt   20  34.552 ± 0.264  ns/op
    PyLongUnary.neg_java     avgt  100  16.676 ± 0.098  ns/op
    PyLongUnary.negbig       avgt   20  35.507 ± 0.636  ns/op
    PyLongUnary.negbig_java  avgt   20  16.537 ± 0.088  ns/op

The invocation overhead is consistently around 18ns on this machine,
over the basic cost of the pure Java operation.
The test fixture code looks (typically) like this:

..  code:: java

    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)

    @Fork(2)
    @Warmup(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)

    @State(Scope.Thread)
    public class PyFloatUnary {

        double v = 42.0;
        PyObject vo = Py.val(v);

        @Benchmark
        @Fork(4)  // Needs a lot of iterations to resolve short times
        @Measurement(iterations = 50)
        public double nothing() { return v; }

        @Benchmark
        @Fork(4)  // Needs a lot of iterations to resolve short times
        @Measurement(iterations = 50)
        public double neg_java() { return -v; }

        @Benchmark
        public PyObject neg() throws Throwable {
            return Number.negative(vo);
        }
        // ...

We have a test method on the API call,
and a nearest equivalent Java method (suffix ``_java``).
The name is chosen to make them adjacent in the results table.

A result must always be returned that depends on all the code under test,
so that JMH can convince Java that the result will be used elsewhere,
and that it cannot be optimised away to nothing.

We also measured a ``nothing`` (empty body) method
as a check on the overheads of simply calling a method under test.
Notice that the costs of ``PyFloatUnary.neg_java`` and of ``nothing``
are not significantly different.
If there is a difference, it is less than one clock cycle of the CPU.
We give an explanation for this in the section
:ref:`benchmark-vanishing-time`.


.. _benchmark-unary-jython2:

Jython 2.7.2
************

We measure ``PyObject.__neg__``
as this is comparable in purpose to the VSJ 2 abstract API,
and is called from compiled Python.
The integer arguments in this measurement are Python 2 ``long``
for comparability with the Python 3 ``int``.
They are implemented as ``BigInteger``.

..  code:: none

    Benchmark                Mode  Cnt   Score   Error  Units
    PyFloatUnary.neg         avgt   20  15.095 ± 0.090  ns/op
    PyFloatUnary.neg_java    avgt  200   5.400 ± 0.039  ns/op
    PyFloatUnary.nothing     avgt  200   5.653 ± 0.058  ns/op
    PyLongUnary.neg          avgt   20  24.859 ± 0.214  ns/op
    PyLongUnary.neg_java     avgt   20  16.601 ± 0.144  ns/op
    PyLongUnary.negbig       avgt   20  24.725 ± 0.142  ns/op
    PyLongUnary.negbig_java  avgt   20  16.640 ± 0.086  ns/op

We can see that the invocation overhead of the Jython 2 approach
is 8-9ns on this machine.


.. _benchmark-unary-vsj2-indy:

VSJ 2 evo 4 with ``invokedynamic``
**********************************

We measure a specially-generated equivalent to ``Number.negative``,
that contains just an ``invokedynamic`` instruction,
and will become linked to a mutable call site at run time.
We do not yet have a compiler for Python that would generate that code,
but this allows us to benchmark the fragment we expect one to emit.

The call site becomes specialised to invoke ``Slot.op_neg``
from the type (or types) encountered,
and therefore we call the same VSJ 2 implementation of ``__neg__``
that was engaged in the plain VSJ 2 benchmark.
Only the linkage and call mechanisms are different.
In particular, Python ``int`` is still implemented using ``BigInteger``.

..  code:: none

    Benchmark                Mode  Cnt   Score   Error  Units
    PyFloatUnary.neg         avgt   20  14.273 ± 0.161  ns/op
    PyFloatUnary.neg_java    avgt  200   5.492 ± 0.031  ns/op
    PyFloatUnary.nothing     avgt  200   5.562 ± 0.064  ns/op
    PyLongUnary.neg          avgt   20  25.405 ± 0.555  ns/op
    PyLongUnary.neg_java     avgt  100  16.691 ± 0.108  ns/op
    PyLongUnary.negbig       avgt   20  24.763 ± 0.489  ns/op
    PyLongUnary.negbig_java  avgt   20  16.482 ± 0.093  ns/op

We can see that the invocation overhead of the dynamic implementations
relative to pure Java is about 9ns on this machine.

The calls benchmarked are to this method, generated using ASM,
intended to mimic what we would expect a compiler to output:

..  code:: none

    public class uk.co.farowl.vsj2dy.evo4.AbstractProxy {
      public static PyObject negative(PyObject);
        Code:
           0: aload_0
           1: invokedynamic #15,  0
                        // InvokeDynamic #0:negative:(LPyObject;)LPyObject;
           6: areturn
    ...
    }

In the disassembly the package name prefixes ``uk/co/farowl/vsj2/evo4/``
and ``uk.co.farowl.vsj2.evo4.``
have been elided and lines broken for the sake of readability.
The generated class AbstractProxy is used in the benchmarks
in place of the abstract numeric API ``Number``:

..  code:: java

    public class PyFloatUnary {

        double v = 42.0;
        PyObject vo = Py.val(v);

        @Benchmark
        @Fork(4)  // Needs a lot of iterations to resolve short times
        @Measurement(iterations = 50)
        public double nothing() { return v; }

        @Benchmark
        public PyObject neg() throws Throwable {
            return AbstractProxy.negative(vo);
        }

        @Benchmark
        @Fork(4)  // Needs a lot of iterations to resolve short times
        @Measurement(iterations = 50)
        public double neg_java() { return -v; }
        // ...


The ``MutableCallSite`` specialisation on the receiving end
is straight out of the textbook in the unary case (some set-up removed):

..  code:: java

    static class UnaryOpCallSite extends MutableCallSite {
        //...

        private final Slot op;

        public UnaryOpCallSite(Slot op)
                throws NoSuchMethodException, IllegalAccessException {
            super(UOP);
            this.op = op;
            setTarget(fallbackMH.bindTo(this));
        }

        private PyObject fallback(PyObject v) throws Throwable {
            PyType vType = v.getType();
            MethodHandle resultMH, targetMH;

            if (op.isDefinedFor(vType)) {
                resultMH = op.getSlot(vType);
            } else {
                resultMH = OPERAND_ERROR.bindTo(op);
            }

            // MH for guarded invocation (becomes new target)
            MethodHandle guardMH = CLASS_GUARD.bindTo(v.getClass());
            targetMH = guardWithTest(guardMH, resultMH, getTarget());
            setTarget(targetMH);

            // Compute the result for this case
            return (PyObject) resultMH.invokeExact(v);
        }
        //...


.. _benchmark-unary-vsj3:

VSJ 3 evo 1
***********

VSJ 3 is the "plain Java object" implementation.
There is no ``PyObject`` that all Python objects extend or implement.
We associate Python types with classes through a ``ClassValue``,
that permits a ``BigInteger`` to be recognised directly as an ``int``,
for example,
and a ``Double`` as a ``float``.

As in VSJ 2,
each operation of which an object is capable,
is accessed through a ``MethodHandle``
stored in a data structure that describes the Python type.
Since the Python type is no longer written on the object, in VSJ 3,
finding the handle is less direct than in VSJ 2,
and we should expect the extra work
(a call to ``ClassValue.get()``)
to show in the time taken to invoke the operation.

..  code:: none

    Benchmark                Mode  Cnt   Score   Error  Units
    PyFloatUnary.neg         avgt   20  31.125 ± 0.377  ns/op
    PyFloatUnary.neg_java    avgt  200   5.646 ± 0.050  ns/op
    PyFloatUnary.nothing     avgt  200   5.773 ± 0.070  ns/op
    PyLongUnary.neg          avgt   20  26.226 ± 0.809  ns/op
    PyLongUnary.neg_java     avgt  100   5.441 ± 0.052  ns/op
    PyLongUnary.negbig       avgt   20  32.605 ± 0.663  ns/op
    PyLongUnary.negbig_java  avgt   20  16.509 ± 0.109  ns/op

Compared with VSJ 2 evo4,
the overhead for ``float`` has indeed increased to 25ns (up from around 18ns),
but in fact we are doing slightly better than VSJ 2 with ``int``.
This will count when we are interpreting CPython byte code.
We have no measurements (at the time of writing)
to tell us whether this is important
relative to the overhead of the interpreter loop.

The comparison with VSJ 2 is not quite direct,
since in VSJ 3 we represent ``int`` by ``Integer``,
if the value is not too big.
This saves work in ``PyLongUnary.neg``.
Its comparator ``PyLongUnary.neg_java``
is written using a primitive Java ``int``.


.. _benchmark-unary-vsj3-indy:

VSJ 3 evo 1 with ``invokedynamic``
**********************************

VSJ 3 also supports binding the ``MethodHandle``\s
into ``invokedynamic`` call sites.
The mechanism for doing so is more complex
than the one we layered onto VSJ 2,
but in return we create the possibility of binding versions
specialised to the argument(s).
For example, the call site in ``PyLongUnary.neg``
will be bound to a method with signature ``Object __neg__(Integer)``.
Binding is a one-time cost (per call site and type).

..  code:: none

    Benchmark                Mode  Cnt   Score   Error  Units
    PyFloatUnary.neg         avgt   20  12.590 ± 0.108  ns/op
    PyFloatUnary.neg_java    avgt  200   5.511 ± 0.025  ns/op
    PyFloatUnary.nothing     avgt  200   5.612 ± 0.053  ns/op
    PyLongUnary.neg          avgt   20  12.913 ± 0.051  ns/op
    PyLongUnary.neg_java     avgt  100   5.408 ± 0.053  ns/op
    PyLongUnary.negbig       avgt   20  16.752 ± 0.341  ns/op
    PyLongUnary.negbig_java  avgt   20  16.544 ± 0.117  ns/op

For ``float`` and small ``int`` the overhead is just 7ns,
while for ``int`` big enough to need a ``BigInteger``,
we there seems to be no overhead at all.


Analysis
********

Basic Slot Dispatch
===================

The plain VSJ 2 implementation dispatches through a ``MethodHandle``
in the following way:

..  code:: java

    public class Number extends Abstract { // ...
        public static PyObject negative(PyObject v) throws Throwable {
            try {
                return (PyObject) v.getType().op_neg.invokeExact(v);
            } catch (Slot.EmptyException e) {
                throw operandError("unary -", v);
            }
        }
        // ...

..  code:: java

    class PyFloat extends AbstractPyObject { // ...

        private PyObject __neg__() { return new PyFloat(-value); }

The 18ns that this dispatch costs in VSJ 2 on the test machine
is not very much time:
not enough to create frames for the apparent depth of call the stack.
We explain this in terms of in-lining carried out by Java HotSpot.

For example, in the floating-point benchmark,
we should expect ``Number.negative`` to have been in-lined at the call site,
and specialised for ``PyFloat``.
At the same time, the ``PyFloat`` constructor call will have been
in-lined in ``__neg__``.
The residual time probably consists of a guard
(a check that ``v`` is in fact a ``PyFloat``),
and a call to ``TYPE.op_neg.invokeExact`` on the optimised handle.

In comparison, Jython 2 dispatch consists of a Java virtual method call
to ``PyObject.__neg__``,
overridden by ``PyFloat.__neg__``,
which itself has essentially the same form as in the VSJ 2 implementation.
This dispatch costs only about 8ns,
suggesting that the virtual call is fully in-lined,
and specialised to ``PyFloat``,
after a simple guard on type.


.. _benchmark-invoke-barrier:

The ``invokeExact`` Barrier
===========================

Jython is significantly quicker than plain VSJ 2.
It begins to look as if the called implementation of ``Number.negative``
cannot be in-lined across an ``invokeExact`` call.
The equivalent path in VSJ 3 displays the same problem.
Why might this be?

Inlining is not safe here because
Java cannot tell that the handle stored in a ``PyType`` will not change.
We cannot declare it ``final`` in ``PyType`` generally,
since in some types (although not ``int`` or ``float``)
the handle will be re-written
when ``__neg__`` is defined in the called type or an ancestor.
In Python this can happen at any time.

This sets a limit to what can be expected of interpreted CPython byte code.


Recovery with ``invokedynamic``
===============================

Turning now to VSJ 2 with ``invokedynamic``,
performance recovers to equal that of Jython 2,
suggesting that the JVM is successfully in-lining the method handles
installed by the ``UnaryOpCallSite``.
We apply a class-guard that wraps the ``op_neg`` handle,
and falls back to a method that will look for the correct handle.
When the JVM specialises the in-lined call in Jython 2,
it too must check the specialisation applies to each new argument.
The timings tell us the checks in VSJ 2 cost no more than those in Jython 2.

By installing the handle on ``__neg__`` as the target of the call site,
the run-time system implicitly guarantees to the JVM that in-lining is safe.
We have to use a mutable call site,
one where the handle may change,
because a new class may come along at any time and fail the class guard.
Then we will re-write the target,
and the JVM will respond by adapting the in-line code.

The call site we implemented in VSJ 2 with ``invokedynamic``
is incorrect in this respect.
It assumes no re-definition of ``__neg__`` will occur,
which is correct for the types in the test but will not do long-term.
For types that allow re-assignment of special functions
(something the type object must indicate is a possibility),
and for types that allow object type to be changed,
a different handle should be installed
that always goes via ``op_neg`` in the type the object (currently) has.

The VSJ 3 call site is designed to allow for mutable types.
For immutable types like ``int`` and ``float``
it will install the fast path.

.. _benchmark-unary-class-specific-dispatch:

Dispatch Specific to Java Class
===============================

The other big difference in VSJ 3 from VSJ 2 is
the adoption of multiple types as implementations of
a built-in Python object type.
Operations are defined separately for each implementing Java class,
and it is these definitions that the abstract API will invoke.

..  code:: java

    public class Number extends Abstract { // ...
        public static PyObject negative(PyObject v) throws Throwable {
            try {
                return Operations.of(v).op_neg.invokeExact(v);
            } catch (Slot.EmptyException e) {
                throw operandError(Slot.op_neg, v);
            }
        }
        // ...

..  code:: java

    class PyFloatMethods { // ...

        static Object __neg__(PyFloat self) { return -self.value; }
        static Object __neg__(Double self) { return -self.doubleValue(); }

The ``PyFloat`` implementation exists so that
we may sub-class ``float`` in Python.
A method is defined for all implementing classes (or for a super-class),
or it is not defined for the type,
and the slot will then either be inherited or be empty.

When binding the target for a newly-encountered call into a call site,
the site will find the definition for that class
and (for immutable types) bind that directly.
If the method is not defined,
it will bind one that raises a Python ``TypeError``.

It is clear in the timings that specialisation and
the simpler ``MethodHandle``\s
do reduce the overhead as hoped,
beating Jython 2 by a small margin.