..  performance/binary-operations.rst


Binary Operations
#################

Binary operations are interesting in Python
because of the way left and right operands both get a chance
to reply with the answer.
This complexity is the price we pay for dynamic typing.

As the overhead logic does not vary with the operation concerned,
there would be little point in measuring exhaustively,
all the different arithmetic operations.
Instead, we focus on a single operation (addition)
and explore the effect of varying the types of the operands
(``float`` and ``int``).

The integers in these measurements are arbitrary precision integers
(``BigInteger`` or a Python type based on it)
for comparability.


.. _benchmark-binary-vsj2:

VSJ 2 evo 4
***********

We are measuring the execution time of ``Number.add``.
We measure ``Number.multiply`` to confirm its overhead is similar.
Again we compare the API call with an equivalent calculation in pure Java,
where the type of the operand is known statically.

..  code:: none

    Benchmark                           Mode  Cnt    Score   Error  Units
    PyFloatBinary.add_float_float       avgt   20   37.583 ± 0.436  ns/op
    PyFloatBinary.add_float_float_java  avgt  200    5.884 ± 0.034  ns/op
    PyFloatBinary.add_float_int         avgt   20   49.790 ± 0.240  ns/op
    PyFloatBinary.add_float_int_java    avgt  200   12.914 ± 0.102  ns/op
    PyFloatBinary.add_int_float         avgt   20   75.128 ± 0.429  ns/op
    PyFloatBinary.add_int_float_java    avgt  200   13.183 ± 0.049  ns/op
    PyFloatBinary.mul_float_float       avgt   20   37.640 ± 0.193  ns/op
    PyFloatBinary.mul_float_float_java  avgt  200    5.835 ± 0.029  ns/op
    PyFloatBinary.nothing               avgt  200    5.518 ± 0.061  ns/op
    PyFloatBinary.quartic               avgt   20  169.479 ± 0.608  ns/op
    PyFloatBinary.quartic_java          avgt  200    6.417 ± 0.103  ns/op
    PyLongBinary.add                    avgt   20   59.843 ± 1.894  ns/op
    PyLongBinary.add_java               avgt   20   31.587 ± 1.567  ns/op
    PyLongBinary.addbig                 avgt   20   71.830 ± 0.624  ns/op
    PyLongBinary.addbig_java            avgt   20   39.237 ± 0.899  ns/op
    PyLongBinary.mul                    avgt   20   75.785 ± 0.948  ns/op
    PyLongBinary.mul_java               avgt   20   43.508 ± 0.213  ns/op
    PyLongBinary.mulbig                 avgt   20   93.269 ± 6.621  ns/op
    PyLongBinary.mulbig_java            avgt   20   56.321 ± 0.336  ns/op

The invocation overhead is 28-32ns on this machine for similar types,
added or multiplied (``int`` with ``int``, ``float`` with ``float``).
``float + int`` is slower (at 37ns) perhaps related to the need to convert
the ``int`` to a ``float``,
and ``int + float`` even slower (at 62ns) because ``int`` is consulted first,
only to return ``NotImplemented``.

The test fixture that produces this table consists of methods
that typically look like this (without the JMH annotations):

..  code:: java

    public class PyFloatBinary {

        BigInteger iv = BigInteger.valueOf(6), iw = BigInteger.valueOf(7);
        double v = 1.01 * iv.doubleValue(), w = 1.01 * iw.doubleValue();
        PyObject fvo = Py.val(v), fwo = Py.val(w);
        PyObject ivo = Py.val(iv), iwo = Py.val(iv);

        public double nothing() { return v; }
        public double add_float_int_java() { return v + iw.doubleValue(); }

        public PyObject add_float_int() throws Throwable {
            return Number.add(fvo, iwo);
        }
        // ...

        public double quartic_java() { return v * w * (v + w) * (v - w); }

        public PyObject quartic() throws Throwable {
            return Number.multiply(
                    Number.multiply(Number.multiply(fvo, fwo),
                            Number.add(fvo, fwo)),
                    Number.subtract(fvo, fwo));
        }
        // ...
    }

We throw in a ``quartic`` calculation to examine how the time scales
with complexity but no new data fetches.
This method calls for 5 floating operations.


.. _benchmark-binary-jython2:

Jython 2.7.2
************

We measure ``PyObject._add`` and ``PyObject._mul``
as these are comparable in purpose to the VSJ 2 abstract API,
and are called from compiled Python.
The integers in this measurement are Python 2 ``long`` for comparability
with Python 3 ``int``.

..  code:: none

    Benchmark                           Mode  Cnt   Score   Error  Units
    PyFloatBinary.add_float_float       avgt   20  17.086 ± 0.396  ns/op
    PyFloatBinary.add_float_float_java  avgt  200   5.719 ± 0.035  ns/op
    PyFloatBinary.add_float_int         avgt   20  36.094 ± 0.227  ns/op
    PyFloatBinary.add_float_int_java    avgt  200  12.554 ± 0.087  ns/op
    PyFloatBinary.add_int_float         avgt   20  34.508 ± 0.449  ns/op
    PyFloatBinary.add_int_float_java    avgt  200  13.114 ± 0.056  ns/op
    PyFloatBinary.mul_float_float       avgt   20  16.731 ± 0.069  ns/op
    PyFloatBinary.mul_float_float_java  avgt  200   5.930 ± 0.050  ns/op
    PyFloatBinary.nothing               avgt  200   5.530 ± 0.050  ns/op
    PyFloatBinary.quartic               avgt   20  18.168 ± 0.116  ns/op
    PyFloatBinary.quartic_java          avgt  200   6.550 ± 0.065  ns/op
    PyLongBinary.add                    avgt   20  37.721 ± 0.496  ns/op
    PyLongBinary.add_java               avgt   20  29.790 ± 1.104  ns/op
    PyLongBinary.addbig                 avgt   20  49.441 ± 0.457  ns/op
    PyLongBinary.addbig_java            avgt   20  38.964 ± 0.198  ns/op
    PyLongBinary.mul                    avgt   20  53.263 ± 0.202  ns/op
    PyLongBinary.mul_java               avgt   20  43.607 ± 0.170  ns/op
    PyLongBinary.mulbig                 avgt   20  68.540 ± 1.227  ns/op
    PyLongBinary.mulbig_java            avgt   20  55.820 ± 0.165  ns/op
    PyLongBinary.nothing                avgt   20   5.522 ± 0.119  ns/op


The invocation overhead is 10-13ns on this machine for similar types,
added or multiplied (``int`` with ``int``, ``float`` with ``float``).
For ``float + int`` and ``int + float``
the overhead is roughly double (at 24ns)
perhaps related to the need to convert the ``int`` to a ``float``,
The need to consult ``int`` first in ``int + float``
does not seem to have added any overhead.


.. _benchmark-binary-vsj2-indy:

VSJ 2 evo 4 with ``invokedynamic``
**********************************

Our benchmarks depend on specially-generated equivalents to ``Number.add``,
``Number.multiply`` and (in one place) ``Number.subtract``
that contain just an ``invokedynamic`` instruction.

..  code:: none

    Benchmark                           Mode  Cnt   Score   Error  Units
    PyFloatBinary.add_float_float       avgt   20  17.682 ± 0.371  ns/op
    PyFloatBinary.add_float_float_java  avgt   20   5.686 ± 0.057  ns/op
    PyFloatBinary.add_float_int         avgt   20  22.065 ± 0.182  ns/op
    PyFloatBinary.add_float_int_java    avgt   20  12.675 ± 0.081  ns/op
    PyFloatBinary.add_int_float         avgt   20  24.399 ± 0.743  ns/op
    PyFloatBinary.add_int_float_java    avgt   20  13.061 ± 0.126  ns/op
    PyFloatBinary.mul_float_float       avgt   20  16.066 ± 0.509  ns/op
    PyFloatBinary.mul_float_float_java  avgt   20   5.688 ± 0.073  ns/op
    PyFloatBinary.nothing               avgt   20   5.597 ± 0.126  ns/op
    PyFloatBinary.quartic               avgt   20  49.196 ± 0.211  ns/op
    PyFloatBinary.quartic_java          avgt   20   6.619 ± 0.609  ns/op
    PyLongBinary.add                    avgt   20  45.005 ± 1.198  ns/op
    PyLongBinary.add_java               avgt   20  29.224 ± 1.099  ns/op
    PyLongBinary.addbig                 avgt   20  54.124 ± 0.392  ns/op
    PyLongBinary.addbig_java            avgt   20  38.736 ± 0.238  ns/op
    PyLongBinary.mul                    avgt   20  56.878 ± 0.325  ns/op
    PyLongBinary.mul_java               avgt   20  44.448 ± 0.807  ns/op
    PyLongBinary.mulbig                 avgt   20  65.876 ± 0.267  ns/op
    PyLongBinary.mulbig_java            avgt   20  54.934 ± 0.191  ns/op

The invocation overhead is 10-15ns on this machine.
As before we see some additional cost to convert types during
``float + int`` and ``int + float``.

As in the unary case,
the call sites become specialised to invoke ``op_add``,
``op_mul`` or ``op_sub`` from the type (or types) encountered.
The call site is quite complicated compared to the unary case
because the types of both arguments must be taken into account
in the specialisation.


.. _benchmark-binary-vsj3:

VSJ 3 evo 1
***********

VSJ 3 is the "plain Java object" implementation, where
finding the handle of an operation involves work
(a call to ``ClassValue.get()``)
not necessary with the type-labelled ``PyObject``\s of VSJ 2.

..  code:: none

    27/02 18:30
    Benchmark                            Mode  Cnt    Score   Error  Units
    PyFloatBinary.add_float_float        avgt   20   54.345 ± 0.835  ns/op
    PyFloatBinary.add_float_float_java   avgt  200    5.993 ± 0.050  ns/op
    PyFloatBinary.add_float_int          avgt   20   68.574 ± 1.226  ns/op
    PyFloatBinary.add_float_int_java     avgt  200    6.851 ± 0.217  ns/op
    PyFloatBinary.add_int_float          avgt   20   94.763 ± 2.103  ns/op
    PyFloatBinary.add_int_float_java     avgt  200    6.299 ± 0.062  ns/op
    PyFloatBinary.addbig_float_int       avgt   20   76.177 ± 0.308  ns/op
    PyFloatBinary.addbig_float_int_java  avgt   20   17.232 ± 0.136  ns/op
    PyFloatBinary.addbig_int_float       avgt   20   98.819 ± 0.304  ns/op
    PyFloatBinary.addbig_int_float_java  avgt   20   20.261 ± 0.094  ns/op
    PyFloatBinary.mul_float_float        avgt   20   53.369 ± 0.275  ns/op
    PyFloatBinary.mul_float_float_java   avgt  200    5.995 ± 0.048  ns/op
    PyFloatBinary.nothing                avgt  200    5.618 ± 0.037  ns/op
    PyFloatBinary.quartic                avgt   20  257.507 ± 1.100  ns/op
    PyFloatBinary.quartic_java           avgt  200    6.934 ± 0.132  ns/op
    PyLongBinary.add                     avgt   20   60.712 ± 0.622  ns/op
    PyLongBinary.add_java                avgt   20   30.235 ± 0.968  ns/op
    PyLongBinary.addbig                  avgt   20   85.987 ± 0.794  ns/op
    PyLongBinary.addbig_java             avgt   20   38.598 ± 0.914  ns/op
    PyLongBinary.mul                     avgt   20   78.763 ± 1.996  ns/op
    PyLongBinary.mul_java                avgt   20   43.609 ± 0.212  ns/op
    PyLongBinary.mulbig                  avgt   20   96.414 ± 0.915  ns/op
    PyLongBinary.mulbig_java             avgt   20   55.932 ± 0.576  ns/op


Compared with VSJ 2 evo4,
for ``float`` the overhead has indeed increased to 50-90ns
(up from around 30ns),
but in fact we are doing about the same as VSJ 2 with ``int``.

Again the comparison with VSJ 2 is not quite direct,
since in VSJ 3 we represent ``int`` by ``Integer``,
if the value is not too big.
The ``int``\s are ``Integer`` in ``add_float_int`` and ``add_int_float``,
and primitive Java int in their Java counterparts.
The tests with ``addbig`` or ``mulbig`` in the name use ``BigInteger``.


.. _benchmark-binary-vsj3-indy:

VSJ 3 evo 1 with ``invokedynamic``
**********************************

When VSJ 3 binds the ``MethodHandle``\s
into binary ``invokedynamic`` call sites,
it uses definitions specialised to the types of both operands.

For example, the call site in ``PyFloatBinary.add_int_float``
will be bound to a method with signature ``Object __add__(Integer, Double)``,
provided by the implementation of ``float``.
The fact that ``int`` provides no method with this signature,
and this can be decided at the time the call site is being bound,
makes it unnecessary to consult ``int`` during the call itself.
(This only applies to types where the definition cannot change.)

..  code:: none

    Benchmark                            Mode  Cnt   Score   Error  Units
    PyFloatBinary.add_float_float        avgt   20  12.522 ± 0.112  ns/op
    PyFloatBinary.add_float_float_java   avgt   20   6.026 ± 0.148  ns/op
    PyFloatBinary.add_float_int          avgt   20  16.439 ± 0.144  ns/op
    PyFloatBinary.add_float_int_java     avgt   20   6.404 ± 0.280  ns/op
    PyFloatBinary.add_int_float          avgt   20  15.417 ± 0.047  ns/op
    PyFloatBinary.add_int_float_java     avgt   20   6.563 ± 0.285  ns/op
    PyFloatBinary.addbig_float_int       avgt   20  24.067 ± 0.245  ns/op
    PyFloatBinary.addbig_float_int_java  avgt   20  16.972 ± 0.072  ns/op
    PyFloatBinary.addbig_int_float       avgt   20  23.798 ± 0.128  ns/op
    PyFloatBinary.addbig_int_float_java  avgt   20  20.342 ± 0.108  ns/op
    PyFloatBinary.mul_float_float        avgt   20  12.604 ± 0.062  ns/op
    PyFloatBinary.mul_float_float_java   avgt   20   6.106 ± 0.268  ns/op
    PyFloatBinary.nothing                avgt   20   5.741 ± 0.080  ns/op
    PyFloatBinary.quartic                avgt   20  12.925 ± 0.123  ns/op
    PyFloatBinary.quartic_java           avgt   20   6.746 ± 0.585  ns/op
    PyLongBinary.add                     avgt   20  15.204 ± 0.172  ns/op
    PyLongBinary.add_java                avgt   20   5.181 ± 0.173  ns/op
    PyLongBinary.addbig                  avgt   20  43.007 ± 0.388  ns/op
    PyLongBinary.addbig_java             avgt   20  38.462 ± 0.245  ns/op
    PyLongBinary.mul                     avgt   20  15.770 ± 0.099  ns/op
    PyLongBinary.mul_java                avgt   20   6.080 ± 0.190  ns/op
    PyLongBinary.mulbig                  avgt   20  61.291 ± 0.453  ns/op
    PyLongBinary.mulbig_java             avgt   20  56.014 ± 0.636  ns/op


Analysis
********

Binary Slot Dispatch
====================

The semantics of binary operations in Python are complex,
in particular the way in which the types of both arguments
must be consulted,
and types give way to sub-types.
This is the reason why we explore the combinations
``float+float``, ``float+int`` and ``int+float`` separately.
The last of these has the largest overhead,
since the ``int.__add__`` must be consulted and return ``NotImplemented``,
before ``float.__add__`` comes up with the answer.

Even in the case ``float+int`` where ``float.__add__`` will succeed,
code that is ignorant of the particular types
must check that ``int`` is not a sub-type of it,
then apply a test for ``NotImplemented``.
This happens for every addition,
compared with Java in which the types are statically known,
and only a handful of instructions need be executed.


Greater Strain on In-lining
===========================

In the binary performance tests,
we again see that Jython 2 is faster than VSJ 2
and VSJ 3 using ``invokeExact``,
supporting the hypothesis that the virtual method calls
are more successfully in-lined.
(See :ref:`benchmark-invoke-barrier`.)

In Jython 2 ``float`` tests,
the difference made by having ``int`` on the left,
and returning ``NotImplemented`` each time is not pronounced.
The assembly code generated by the JVM is long and complex,
but it appears that having in-lined the body of ``PyLong.__add__``,
the compiler can see that ``NotImplemented`` is the inevitable result,
and goes directly to ``PyFloat.__radd__``.

In VSJ 2, we see quite a big penalty for having ``int`` on the left.
This deficit is only partly made up in VSJ 2 with ``invokedynamic``.
We speculate that the method handles are only partially in-lined
because they are too deeply nested for the JVM to do so fully.
The wrappers that test for ``NotImplemented`` contribute to that depth,
in addition to the class guards.

The shortcoming we noted in VSJ 2 unary ``invokedynamic`` call sites,
that they assume no re-definition of the operations may occur,
is (doubly) present in the binary case,
but we correct that in VSJ 3.


Dispatch Specific to Java Class
===============================

We saw a small advantage in VSJ 3 from
:ref:`benchmark-unary-class-specific-dispatch` in the unary case.
In the binary case,
we gain enormously from specialisation in a mixed case like ``int+float``.
After a guard on class,
the course of events may be plotted completely by the call site:
it is not necessary (in the case cited) to consult ``int``,
or to test for ``NotImplemented`` in the handle.

To pull this off,
we have to supply quite a number of ``static`` implementation methods.
Each implementation class of ``float`` must be supported
as the first argument of ``__add__`` and ``__radd__``,
then combined with each legitimate other operand class for the second.

..  code:: java

    class PyFloatBinops { // ...
        static Object __add__(PyFloat v, PyFloat w) {
            return v.value + w.value;
        }
        static Object __add__(PyFloat v, Double w) {
            return v.value + w.doubleValue();
        }
        static Object __add__(PyFloat v, Integer w) {
            return v.value + w.doubleValue();
        }
        static Object __add__(PyFloat v, BigInteger w) {
            return v.value + PyLong.convertToDouble(w);
        }
        // ...
        static Object __add__(Double v, PyFloat w) {
            return v.doubleValue() + w.value;
        }
        // ...
        static Object __radd__(PyFloat w, PyFloat v) {
            return v.value + w.value;
        }
        // ...

There are many combinations and we use a script to generate them.
The call site binds in the method for each operand class pair,
under a double class guard.


Thoughts on the quartic test
============================

The test ``quartic`` provides a surprise or two.
This method asks for 5 floating operations
(3 multiplications, an addition and a subtraction).
We have seen that the time for each is not greatly different in isolation.

The pure Java version ``quartic_java`` is noteworthy
for taking barely a nanosecond longer than a single addition.
This is discussed in :ref:`benchmark-vanishing-time`.
The pipelining and concurrency evident in the result
is possible when the floating-point operations are adjacent
(part of the same expression, say).

Jython 2 also achieves a time for ``quartic``
roughly the same as its own ``add_float_float``,
suggesting the residual overhead (probably two type checks) is paid only once
and the in-lined code optimised as well as for the native case.

In VSJ 2, the overhead of the ``quartic`` is basically 5 times
that of ``add_float_float``,
showing that there is no in-lining of the separate calls
that would bring the floating point calculation together in one place.

This is also approximately true of VSJ 2 with ``invokedynamic``:
the overhead relative to pure Java is 43ns,
around 3.5 times the 12ns on ``add_float_float`` .
Evidently we do not get the remarkable concurrency seen in ``quartic``
for Jython 2 and pure Java.

Finally in VSJ 3,
the specialisation to class allows the handles to in-line fully,
and we are down to 7ns total overhead
relative to the pure Java expression.