6.2. Binary Operations

Binary operations are interesting in Python because of the way left and right operands both get a chance to reply with the answer. This complexity is the price we pay for dynamic typing.

As the overhead logic does not vary with the operation concerned, there would be little point in measuring exhaustively, all the different arithmetic operations. Instead, we focus on a single operation (addition) and explore the effect of varying the types of the operands (float and int).

The integers in these measurements are arbitrary precision integers (BigInteger or a Python type based on it) for comparability.

6.2.1. VSJ 2 evo 4

We are measuring the execution time of Number.add. We measure Number.multiply to confirm its overhead is similar. Again we compare the API call with an equivalent calculation in pure Java, where the type of the operand is known statically.

Benchmark                           Mode  Cnt    Score   Error  Units
PyFloatBinary.add_float_float       avgt   20   37.583 ± 0.436  ns/op
PyFloatBinary.add_float_float_java  avgt  200    5.884 ± 0.034  ns/op
PyFloatBinary.add_float_int         avgt   20   49.790 ± 0.240  ns/op
PyFloatBinary.add_float_int_java    avgt  200   12.914 ± 0.102  ns/op
PyFloatBinary.add_int_float         avgt   20   75.128 ± 0.429  ns/op
PyFloatBinary.add_int_float_java    avgt  200   13.183 ± 0.049  ns/op
PyFloatBinary.mul_float_float       avgt   20   37.640 ± 0.193  ns/op
PyFloatBinary.mul_float_float_java  avgt  200    5.835 ± 0.029  ns/op
PyFloatBinary.nothing               avgt  200    5.518 ± 0.061  ns/op
PyFloatBinary.quartic               avgt   20  169.479 ± 0.608  ns/op
PyFloatBinary.quartic_java          avgt  200    6.417 ± 0.103  ns/op
PyLongBinary.add                    avgt   20   59.843 ± 1.894  ns/op
PyLongBinary.add_java               avgt   20   31.587 ± 1.567  ns/op
PyLongBinary.addbig                 avgt   20   71.830 ± 0.624  ns/op
PyLongBinary.addbig_java            avgt   20   39.237 ± 0.899  ns/op
PyLongBinary.mul                    avgt   20   75.785 ± 0.948  ns/op
PyLongBinary.mul_java               avgt   20   43.508 ± 0.213  ns/op
PyLongBinary.mulbig                 avgt   20   93.269 ± 6.621  ns/op
PyLongBinary.mulbig_java            avgt   20   56.321 ± 0.336  ns/op

The invocation overhead is 28-32ns on this machine for similar types, added or multiplied (int with int, float with float). float + int is slower (at 37ns) perhaps related to the need to convert the int to a float, and int + float even slower (at 62ns) because int is consulted first, only to return NotImplemented.

The test fixture that produces this table consists of methods that typically look like this (without the JMH annotations):

public class PyFloatBinary {

    BigInteger iv = BigInteger.valueOf(6), iw = BigInteger.valueOf(7);
    double v = 1.01 * iv.doubleValue(), w = 1.01 * iw.doubleValue();
    PyObject fvo = Py.val(v), fwo = Py.val(w);
    PyObject ivo = Py.val(iv), iwo = Py.val(iv);

    public double nothing() { return v; }
    public double add_float_int_java() { return v + iw.doubleValue(); }

    public PyObject add_float_int() throws Throwable {
        return Number.add(fvo, iwo);
    }
    // ...

    public double quartic_java() { return v * w * (v + w) * (v - w); }

    public PyObject quartic() throws Throwable {
        return Number.multiply(
                Number.multiply(Number.multiply(fvo, fwo),
                        Number.add(fvo, fwo)),
                Number.subtract(fvo, fwo));
    }
    // ...
}

We throw in a quartic calculation to examine how the time scales with complexity but no new data fetches. This method calls for 5 floating operations.

6.2.2. Jython 2.7.2

We measure PyObject._add and PyObject._mul as these are comparable in purpose to the VSJ 2 abstract API, and are called from compiled Python. The integers in this measurement are Python 2 long for comparability with Python 3 int.

Benchmark                           Mode  Cnt   Score   Error  Units
PyFloatBinary.add_float_float       avgt   20  17.086 ± 0.396  ns/op
PyFloatBinary.add_float_float_java  avgt  200   5.719 ± 0.035  ns/op
PyFloatBinary.add_float_int         avgt   20  36.094 ± 0.227  ns/op
PyFloatBinary.add_float_int_java    avgt  200  12.554 ± 0.087  ns/op
PyFloatBinary.add_int_float         avgt   20  34.508 ± 0.449  ns/op
PyFloatBinary.add_int_float_java    avgt  200  13.114 ± 0.056  ns/op
PyFloatBinary.mul_float_float       avgt   20  16.731 ± 0.069  ns/op
PyFloatBinary.mul_float_float_java  avgt  200   5.930 ± 0.050  ns/op
PyFloatBinary.nothing               avgt  200   5.530 ± 0.050  ns/op
PyFloatBinary.quartic               avgt   20  18.168 ± 0.116  ns/op
PyFloatBinary.quartic_java          avgt  200   6.550 ± 0.065  ns/op
PyLongBinary.add                    avgt   20  37.721 ± 0.496  ns/op
PyLongBinary.add_java               avgt   20  29.790 ± 1.104  ns/op
PyLongBinary.addbig                 avgt   20  49.441 ± 0.457  ns/op
PyLongBinary.addbig_java            avgt   20  38.964 ± 0.198  ns/op
PyLongBinary.mul                    avgt   20  53.263 ± 0.202  ns/op
PyLongBinary.mul_java               avgt   20  43.607 ± 0.170  ns/op
PyLongBinary.mulbig                 avgt   20  68.540 ± 1.227  ns/op
PyLongBinary.mulbig_java            avgt   20  55.820 ± 0.165  ns/op
PyLongBinary.nothing                avgt   20   5.522 ± 0.119  ns/op

The invocation overhead is 10-13ns on this machine for similar types, added or multiplied (int with int, float with float). For float + int and int + float the overhead is roughly double (at 24ns) perhaps related to the need to convert the int to a float, The need to consult int first in int + float does not seem to have added any overhead.

6.2.3. VSJ 2 evo 4 with invokedynamic

Our benchmarks depend on specially-generated equivalents to Number.add, Number.multiply and (in one place) Number.subtract that contain just an invokedynamic instruction.

Benchmark                           Mode  Cnt   Score   Error  Units
PyFloatBinary.add_float_float       avgt   20  17.682 ± 0.371  ns/op
PyFloatBinary.add_float_float_java  avgt   20   5.686 ± 0.057  ns/op
PyFloatBinary.add_float_int         avgt   20  22.065 ± 0.182  ns/op
PyFloatBinary.add_float_int_java    avgt   20  12.675 ± 0.081  ns/op
PyFloatBinary.add_int_float         avgt   20  24.399 ± 0.743  ns/op
PyFloatBinary.add_int_float_java    avgt   20  13.061 ± 0.126  ns/op
PyFloatBinary.mul_float_float       avgt   20  16.066 ± 0.509  ns/op
PyFloatBinary.mul_float_float_java  avgt   20   5.688 ± 0.073  ns/op
PyFloatBinary.nothing               avgt   20   5.597 ± 0.126  ns/op
PyFloatBinary.quartic               avgt   20  49.196 ± 0.211  ns/op
PyFloatBinary.quartic_java          avgt   20   6.619 ± 0.609  ns/op
PyLongBinary.add                    avgt   20  45.005 ± 1.198  ns/op
PyLongBinary.add_java               avgt   20  29.224 ± 1.099  ns/op
PyLongBinary.addbig                 avgt   20  54.124 ± 0.392  ns/op
PyLongBinary.addbig_java            avgt   20  38.736 ± 0.238  ns/op
PyLongBinary.mul                    avgt   20  56.878 ± 0.325  ns/op
PyLongBinary.mul_java               avgt   20  44.448 ± 0.807  ns/op
PyLongBinary.mulbig                 avgt   20  65.876 ± 0.267  ns/op
PyLongBinary.mulbig_java            avgt   20  54.934 ± 0.191  ns/op

The invocation overhead is 10-15ns on this machine. As before we see some additional cost to convert types during float + int and int + float.

As in the unary case, the call sites become specialised to invoke op_add, op_mul or op_sub from the type (or types) encountered. The call site is quite complicated compared to the unary case because the types of both arguments must be taken into account in the specialisation.

6.2.4. VSJ 3 evo 1

VSJ 3 is the “plain Java object” implementation, where finding the handle of an operation involves work (a call to ClassValue.get()) not necessary with the type-labelled PyObjects of VSJ 2.

27/02 18:30
Benchmark                            Mode  Cnt    Score   Error  Units
PyFloatBinary.add_float_float        avgt   20   54.345 ± 0.835  ns/op
PyFloatBinary.add_float_float_java   avgt  200    5.993 ± 0.050  ns/op
PyFloatBinary.add_float_int          avgt   20   68.574 ± 1.226  ns/op
PyFloatBinary.add_float_int_java     avgt  200    6.851 ± 0.217  ns/op
PyFloatBinary.add_int_float          avgt   20   94.763 ± 2.103  ns/op
PyFloatBinary.add_int_float_java     avgt  200    6.299 ± 0.062  ns/op
PyFloatBinary.addbig_float_int       avgt   20   76.177 ± 0.308  ns/op
PyFloatBinary.addbig_float_int_java  avgt   20   17.232 ± 0.136  ns/op
PyFloatBinary.addbig_int_float       avgt   20   98.819 ± 0.304  ns/op
PyFloatBinary.addbig_int_float_java  avgt   20   20.261 ± 0.094  ns/op
PyFloatBinary.mul_float_float        avgt   20   53.369 ± 0.275  ns/op
PyFloatBinary.mul_float_float_java   avgt  200    5.995 ± 0.048  ns/op
PyFloatBinary.nothing                avgt  200    5.618 ± 0.037  ns/op
PyFloatBinary.quartic                avgt   20  257.507 ± 1.100  ns/op
PyFloatBinary.quartic_java           avgt  200    6.934 ± 0.132  ns/op
PyLongBinary.add                     avgt   20   60.712 ± 0.622  ns/op
PyLongBinary.add_java                avgt   20   30.235 ± 0.968  ns/op
PyLongBinary.addbig                  avgt   20   85.987 ± 0.794  ns/op
PyLongBinary.addbig_java             avgt   20   38.598 ± 0.914  ns/op
PyLongBinary.mul                     avgt   20   78.763 ± 1.996  ns/op
PyLongBinary.mul_java                avgt   20   43.609 ± 0.212  ns/op
PyLongBinary.mulbig                  avgt   20   96.414 ± 0.915  ns/op
PyLongBinary.mulbig_java             avgt   20   55.932 ± 0.576  ns/op

Compared with VSJ 2 evo4, for float the overhead has indeed increased to 50-90ns (up from around 30ns), but in fact we are doing about the same as VSJ 2 with int.

Again the comparison with VSJ 2 is not quite direct, since in VSJ 3 we represent int by Integer, if the value is not too big. The ints are Integer in add_float_int and add_int_float, and primitive Java int in their Java counterparts. The tests with addbig or mulbig in the name use BigInteger.

6.2.5. VSJ 3 evo 1 with invokedynamic

When VSJ 3 binds the MethodHandles into binary invokedynamic call sites, it uses definitions specialised to the types of both operands.

For example, the call site in PyFloatBinary.add_int_float will be bound to a method with signature Object __add__(Integer, Double), provided by the implementation of float. The fact that int provides no method with this signature, and this can be decided at the time the call site is being bound, makes it unnecessary to consult int during the call itself. (This only applies to types where the definition cannot change.)

Benchmark                            Mode  Cnt   Score   Error  Units
PyFloatBinary.add_float_float        avgt   20  12.522 ± 0.112  ns/op
PyFloatBinary.add_float_float_java   avgt   20   6.026 ± 0.148  ns/op
PyFloatBinary.add_float_int          avgt   20  16.439 ± 0.144  ns/op
PyFloatBinary.add_float_int_java     avgt   20   6.404 ± 0.280  ns/op
PyFloatBinary.add_int_float          avgt   20  15.417 ± 0.047  ns/op
PyFloatBinary.add_int_float_java     avgt   20   6.563 ± 0.285  ns/op
PyFloatBinary.addbig_float_int       avgt   20  24.067 ± 0.245  ns/op
PyFloatBinary.addbig_float_int_java  avgt   20  16.972 ± 0.072  ns/op
PyFloatBinary.addbig_int_float       avgt   20  23.798 ± 0.128  ns/op
PyFloatBinary.addbig_int_float_java  avgt   20  20.342 ± 0.108  ns/op
PyFloatBinary.mul_float_float        avgt   20  12.604 ± 0.062  ns/op
PyFloatBinary.mul_float_float_java   avgt   20   6.106 ± 0.268  ns/op
PyFloatBinary.nothing                avgt   20   5.741 ± 0.080  ns/op
PyFloatBinary.quartic                avgt   20  12.925 ± 0.123  ns/op
PyFloatBinary.quartic_java           avgt   20   6.746 ± 0.585  ns/op
PyLongBinary.add                     avgt   20  15.204 ± 0.172  ns/op
PyLongBinary.add_java                avgt   20   5.181 ± 0.173  ns/op
PyLongBinary.addbig                  avgt   20  43.007 ± 0.388  ns/op
PyLongBinary.addbig_java             avgt   20  38.462 ± 0.245  ns/op
PyLongBinary.mul                     avgt   20  15.770 ± 0.099  ns/op
PyLongBinary.mul_java                avgt   20   6.080 ± 0.190  ns/op
PyLongBinary.mulbig                  avgt   20  61.291 ± 0.453  ns/op
PyLongBinary.mulbig_java             avgt   20  56.014 ± 0.636  ns/op

6.2.6. Analysis

Binary Slot Dispatch

The semantics of binary operations in Python are complex, in particular the way in which the types of both arguments must be consulted, and types give way to sub-types. This is the reason why we explore the combinations float+float, float+int and int+float separately. The last of these has the largest overhead, since the int.__add__ must be consulted and return NotImplemented, before float.__add__ comes up with the answer.

Even in the case float+int where float.__add__ will succeed, code that is ignorant of the particular types must check that int is not a sub-type of it, then apply a test for NotImplemented. This happens for every addition, compared with Java in which the types are statically known, and only a handful of instructions need be executed.

Greater Strain on In-lining

In the binary performance tests, we again see that Jython 2 is faster than VSJ 2 and VSJ 3 using invokeExact, supporting the hypothesis that the virtual method calls are more successfully in-lined. (See The invokeExact Barrier.)

In Jython 2 float tests, the difference made by having int on the left, and returning NotImplemented each time is not pronounced. The assembly code generated by the JVM is long and complex, but it appears that having in-lined the body of PyLong.__add__, the compiler can see that NotImplemented is the inevitable result, and goes directly to PyFloat.__radd__.

In VSJ 2, we see quite a big penalty for having int on the left. This deficit is only partly made up in VSJ 2 with invokedynamic. We speculate that the method handles are only partially in-lined because they are too deeply nested for the JVM to do so fully. The wrappers that test for NotImplemented contribute to that depth, in addition to the class guards.

The shortcoming we noted in VSJ 2 unary invokedynamic call sites, that they assume no re-definition of the operations may occur, is (doubly) present in the binary case, but we correct that in VSJ 3.

Dispatch Specific to Java Class

We saw a small advantage in VSJ 3 from Dispatch Specific to Java Class in the unary case. In the binary case, we gain enormously from specialisation in a mixed case like int+float. After a guard on class, the course of events may be plotted completely by the call site: it is not necessary (in the case cited) to consult int, or to test for NotImplemented in the handle.

To pull this off, we have to supply quite a number of static implementation methods. Each implementation class of float must be supported as the first argument of __add__ and __radd__, then combined with each legitimate other operand class for the second.

class PyFloatBinops { // ...
    static Object __add__(PyFloat v, PyFloat w) {
        return v.value + w.value;
    }
    static Object __add__(PyFloat v, Double w) {
        return v.value + w.doubleValue();
    }
    static Object __add__(PyFloat v, Integer w) {
        return v.value + w.doubleValue();
    }
    static Object __add__(PyFloat v, BigInteger w) {
        return v.value + PyLong.convertToDouble(w);
    }
    // ...
    static Object __add__(Double v, PyFloat w) {
        return v.doubleValue() + w.value;
    }
    // ...
    static Object __radd__(PyFloat w, PyFloat v) {
        return v.value + w.value;
    }
    // ...

There are many combinations and we use a script to generate them. The call site binds in the method for each operand class pair, under a double class guard.

Thoughts on the quartic test

The test quartic provides a surprise or two. This method asks for 5 floating operations (3 multiplications, an addition and a subtraction). We have seen that the time for each is not greatly different in isolation.

The pure Java version quartic_java is noteworthy for taking barely a nanosecond longer than a single addition. This is discussed in Vanishing Time in Primitive Operations. The pipelining and concurrency evident in the result is possible when the floating-point operations are adjacent (part of the same expression, say).

Jython 2 also achieves a time for quartic roughly the same as its own add_float_float, suggesting the residual overhead (probably two type checks) is paid only once and the in-lined code optimised as well as for the native case.

In VSJ 2, the overhead of the quartic is basically 5 times that of add_float_float, showing that there is no in-lining of the separate calls that would bring the floating point calculation together in one place.

This is also approximately true of VSJ 2 with invokedynamic: the overhead relative to pure Java is 43ns, around 3.5 times the 12ns on add_float_float . Evidently we do not get the remarkable concurrency seen in quartic for Jython 2 and pure Java.

Finally in VSJ 3, the specialisation to class allows the handles to in-line fully, and we are down to 7ns total overhead relative to the pure Java expression.