6.2. Binary Operations¶
Binary operations are interesting in Python because of the way left and right operands both get a chance to reply with the answer. This complexity is the price we pay for dynamic typing.
As the overhead logic does not vary with the operation concerned,
there would be little point in measuring exhaustively,
all the different arithmetic operations.
Instead, we focus on a single operation (addition)
and explore the effect of varying the types of the operands
(float
and int
).
The integers in these measurements are arbitrary precision integers
(BigInteger
or a Python type based on it)
for comparability.
6.2.1. VSJ 2 evo 4¶
We are measuring the execution time of Number.add
.
We measure Number.multiply
to confirm its overhead is similar.
Again we compare the API call with an equivalent calculation in pure Java,
where the type of the operand is known statically.
Benchmark Mode Cnt Score Error Units
PyFloatBinary.add_float_float avgt 20 37.583 ± 0.436 ns/op
PyFloatBinary.add_float_float_java avgt 200 5.884 ± 0.034 ns/op
PyFloatBinary.add_float_int avgt 20 49.790 ± 0.240 ns/op
PyFloatBinary.add_float_int_java avgt 200 12.914 ± 0.102 ns/op
PyFloatBinary.add_int_float avgt 20 75.128 ± 0.429 ns/op
PyFloatBinary.add_int_float_java avgt 200 13.183 ± 0.049 ns/op
PyFloatBinary.mul_float_float avgt 20 37.640 ± 0.193 ns/op
PyFloatBinary.mul_float_float_java avgt 200 5.835 ± 0.029 ns/op
PyFloatBinary.nothing avgt 200 5.518 ± 0.061 ns/op
PyFloatBinary.quartic avgt 20 169.479 ± 0.608 ns/op
PyFloatBinary.quartic_java avgt 200 6.417 ± 0.103 ns/op
PyLongBinary.add avgt 20 59.843 ± 1.894 ns/op
PyLongBinary.add_java avgt 20 31.587 ± 1.567 ns/op
PyLongBinary.addbig avgt 20 71.830 ± 0.624 ns/op
PyLongBinary.addbig_java avgt 20 39.237 ± 0.899 ns/op
PyLongBinary.mul avgt 20 75.785 ± 0.948 ns/op
PyLongBinary.mul_java avgt 20 43.508 ± 0.213 ns/op
PyLongBinary.mulbig avgt 20 93.269 ± 6.621 ns/op
PyLongBinary.mulbig_java avgt 20 56.321 ± 0.336 ns/op
The invocation overhead is 28-32ns on this machine for similar types,
added or multiplied (int
with int
, float
with float
).
float + int
is slower (at 37ns) perhaps related to the need to convert
the int
to a float
,
and int + float
even slower (at 62ns) because int
is consulted first,
only to return NotImplemented
.
The test fixture that produces this table consists of methods that typically look like this (without the JMH annotations):
public class PyFloatBinary {
BigInteger iv = BigInteger.valueOf(6), iw = BigInteger.valueOf(7);
double v = 1.01 * iv.doubleValue(), w = 1.01 * iw.doubleValue();
PyObject fvo = Py.val(v), fwo = Py.val(w);
PyObject ivo = Py.val(iv), iwo = Py.val(iv);
public double nothing() { return v; }
public double add_float_int_java() { return v + iw.doubleValue(); }
public PyObject add_float_int() throws Throwable {
return Number.add(fvo, iwo);
}
// ...
public double quartic_java() { return v * w * (v + w) * (v - w); }
public PyObject quartic() throws Throwable {
return Number.multiply(
Number.multiply(Number.multiply(fvo, fwo),
Number.add(fvo, fwo)),
Number.subtract(fvo, fwo));
}
// ...
}
We throw in a quartic
calculation to examine how the time scales
with complexity but no new data fetches.
This method calls for 5 floating operations.
6.2.2. Jython 2.7.2¶
We measure PyObject._add
and PyObject._mul
as these are comparable in purpose to the VSJ 2 abstract API,
and are called from compiled Python.
The integers in this measurement are Python 2 long
for comparability
with Python 3 int
.
Benchmark Mode Cnt Score Error Units
PyFloatBinary.add_float_float avgt 20 17.086 ± 0.396 ns/op
PyFloatBinary.add_float_float_java avgt 200 5.719 ± 0.035 ns/op
PyFloatBinary.add_float_int avgt 20 36.094 ± 0.227 ns/op
PyFloatBinary.add_float_int_java avgt 200 12.554 ± 0.087 ns/op
PyFloatBinary.add_int_float avgt 20 34.508 ± 0.449 ns/op
PyFloatBinary.add_int_float_java avgt 200 13.114 ± 0.056 ns/op
PyFloatBinary.mul_float_float avgt 20 16.731 ± 0.069 ns/op
PyFloatBinary.mul_float_float_java avgt 200 5.930 ± 0.050 ns/op
PyFloatBinary.nothing avgt 200 5.530 ± 0.050 ns/op
PyFloatBinary.quartic avgt 20 18.168 ± 0.116 ns/op
PyFloatBinary.quartic_java avgt 200 6.550 ± 0.065 ns/op
PyLongBinary.add avgt 20 37.721 ± 0.496 ns/op
PyLongBinary.add_java avgt 20 29.790 ± 1.104 ns/op
PyLongBinary.addbig avgt 20 49.441 ± 0.457 ns/op
PyLongBinary.addbig_java avgt 20 38.964 ± 0.198 ns/op
PyLongBinary.mul avgt 20 53.263 ± 0.202 ns/op
PyLongBinary.mul_java avgt 20 43.607 ± 0.170 ns/op
PyLongBinary.mulbig avgt 20 68.540 ± 1.227 ns/op
PyLongBinary.mulbig_java avgt 20 55.820 ± 0.165 ns/op
PyLongBinary.nothing avgt 20 5.522 ± 0.119 ns/op
The invocation overhead is 10-13ns on this machine for similar types,
added or multiplied (int
with int
, float
with float
).
For float + int
and int + float
the overhead is roughly double (at 24ns)
perhaps related to the need to convert the int
to a float
,
The need to consult int
first in int + float
does not seem to have added any overhead.
6.2.3. VSJ 2 evo 4 with invokedynamic
¶
Our benchmarks depend on specially-generated equivalents to Number.add
,
Number.multiply
and (in one place) Number.subtract
that contain just an invokedynamic
instruction.
Benchmark Mode Cnt Score Error Units
PyFloatBinary.add_float_float avgt 20 17.682 ± 0.371 ns/op
PyFloatBinary.add_float_float_java avgt 20 5.686 ± 0.057 ns/op
PyFloatBinary.add_float_int avgt 20 22.065 ± 0.182 ns/op
PyFloatBinary.add_float_int_java avgt 20 12.675 ± 0.081 ns/op
PyFloatBinary.add_int_float avgt 20 24.399 ± 0.743 ns/op
PyFloatBinary.add_int_float_java avgt 20 13.061 ± 0.126 ns/op
PyFloatBinary.mul_float_float avgt 20 16.066 ± 0.509 ns/op
PyFloatBinary.mul_float_float_java avgt 20 5.688 ± 0.073 ns/op
PyFloatBinary.nothing avgt 20 5.597 ± 0.126 ns/op
PyFloatBinary.quartic avgt 20 49.196 ± 0.211 ns/op
PyFloatBinary.quartic_java avgt 20 6.619 ± 0.609 ns/op
PyLongBinary.add avgt 20 45.005 ± 1.198 ns/op
PyLongBinary.add_java avgt 20 29.224 ± 1.099 ns/op
PyLongBinary.addbig avgt 20 54.124 ± 0.392 ns/op
PyLongBinary.addbig_java avgt 20 38.736 ± 0.238 ns/op
PyLongBinary.mul avgt 20 56.878 ± 0.325 ns/op
PyLongBinary.mul_java avgt 20 44.448 ± 0.807 ns/op
PyLongBinary.mulbig avgt 20 65.876 ± 0.267 ns/op
PyLongBinary.mulbig_java avgt 20 54.934 ± 0.191 ns/op
The invocation overhead is 10-15ns on this machine.
As before we see some additional cost to convert types during
float + int
and int + float
.
As in the unary case,
the call sites become specialised to invoke op_add
,
op_mul
or op_sub
from the type (or types) encountered.
The call site is quite complicated compared to the unary case
because the types of both arguments must be taken into account
in the specialisation.
6.2.4. VSJ 3 evo 1¶
VSJ 3 is the “plain Java object” implementation, where
finding the handle of an operation involves work
(a call to ClassValue.get()
)
not necessary with the type-labelled PyObject
s of VSJ 2.
27/02 18:30
Benchmark Mode Cnt Score Error Units
PyFloatBinary.add_float_float avgt 20 54.345 ± 0.835 ns/op
PyFloatBinary.add_float_float_java avgt 200 5.993 ± 0.050 ns/op
PyFloatBinary.add_float_int avgt 20 68.574 ± 1.226 ns/op
PyFloatBinary.add_float_int_java avgt 200 6.851 ± 0.217 ns/op
PyFloatBinary.add_int_float avgt 20 94.763 ± 2.103 ns/op
PyFloatBinary.add_int_float_java avgt 200 6.299 ± 0.062 ns/op
PyFloatBinary.addbig_float_int avgt 20 76.177 ± 0.308 ns/op
PyFloatBinary.addbig_float_int_java avgt 20 17.232 ± 0.136 ns/op
PyFloatBinary.addbig_int_float avgt 20 98.819 ± 0.304 ns/op
PyFloatBinary.addbig_int_float_java avgt 20 20.261 ± 0.094 ns/op
PyFloatBinary.mul_float_float avgt 20 53.369 ± 0.275 ns/op
PyFloatBinary.mul_float_float_java avgt 200 5.995 ± 0.048 ns/op
PyFloatBinary.nothing avgt 200 5.618 ± 0.037 ns/op
PyFloatBinary.quartic avgt 20 257.507 ± 1.100 ns/op
PyFloatBinary.quartic_java avgt 200 6.934 ± 0.132 ns/op
PyLongBinary.add avgt 20 60.712 ± 0.622 ns/op
PyLongBinary.add_java avgt 20 30.235 ± 0.968 ns/op
PyLongBinary.addbig avgt 20 85.987 ± 0.794 ns/op
PyLongBinary.addbig_java avgt 20 38.598 ± 0.914 ns/op
PyLongBinary.mul avgt 20 78.763 ± 1.996 ns/op
PyLongBinary.mul_java avgt 20 43.609 ± 0.212 ns/op
PyLongBinary.mulbig avgt 20 96.414 ± 0.915 ns/op
PyLongBinary.mulbig_java avgt 20 55.932 ± 0.576 ns/op
Compared with VSJ 2 evo4,
for float
the overhead has indeed increased to 50-90ns
(up from around 30ns),
but in fact we are doing about the same as VSJ 2 with int
.
Again the comparison with VSJ 2 is not quite direct,
since in VSJ 3 we represent int
by Integer
,
if the value is not too big.
The int
s are Integer
in add_float_int
and add_int_float
,
and primitive Java int in their Java counterparts.
The tests with addbig
or mulbig
in the name use BigInteger
.
6.2.5. VSJ 3 evo 1 with invokedynamic
¶
When VSJ 3 binds the MethodHandle
s
into binary invokedynamic
call sites,
it uses definitions specialised to the types of both operands.
For example, the call site in PyFloatBinary.add_int_float
will be bound to a method with signature Object __add__(Integer, Double)
,
provided by the implementation of float
.
The fact that int
provides no method with this signature,
and this can be decided at the time the call site is being bound,
makes it unnecessary to consult int
during the call itself.
(This only applies to types where the definition cannot change.)
Benchmark Mode Cnt Score Error Units
PyFloatBinary.add_float_float avgt 20 12.522 ± 0.112 ns/op
PyFloatBinary.add_float_float_java avgt 20 6.026 ± 0.148 ns/op
PyFloatBinary.add_float_int avgt 20 16.439 ± 0.144 ns/op
PyFloatBinary.add_float_int_java avgt 20 6.404 ± 0.280 ns/op
PyFloatBinary.add_int_float avgt 20 15.417 ± 0.047 ns/op
PyFloatBinary.add_int_float_java avgt 20 6.563 ± 0.285 ns/op
PyFloatBinary.addbig_float_int avgt 20 24.067 ± 0.245 ns/op
PyFloatBinary.addbig_float_int_java avgt 20 16.972 ± 0.072 ns/op
PyFloatBinary.addbig_int_float avgt 20 23.798 ± 0.128 ns/op
PyFloatBinary.addbig_int_float_java avgt 20 20.342 ± 0.108 ns/op
PyFloatBinary.mul_float_float avgt 20 12.604 ± 0.062 ns/op
PyFloatBinary.mul_float_float_java avgt 20 6.106 ± 0.268 ns/op
PyFloatBinary.nothing avgt 20 5.741 ± 0.080 ns/op
PyFloatBinary.quartic avgt 20 12.925 ± 0.123 ns/op
PyFloatBinary.quartic_java avgt 20 6.746 ± 0.585 ns/op
PyLongBinary.add avgt 20 15.204 ± 0.172 ns/op
PyLongBinary.add_java avgt 20 5.181 ± 0.173 ns/op
PyLongBinary.addbig avgt 20 43.007 ± 0.388 ns/op
PyLongBinary.addbig_java avgt 20 38.462 ± 0.245 ns/op
PyLongBinary.mul avgt 20 15.770 ± 0.099 ns/op
PyLongBinary.mul_java avgt 20 6.080 ± 0.190 ns/op
PyLongBinary.mulbig avgt 20 61.291 ± 0.453 ns/op
PyLongBinary.mulbig_java avgt 20 56.014 ± 0.636 ns/op
6.2.6. Analysis¶
Binary Slot Dispatch¶
The semantics of binary operations in Python are complex,
in particular the way in which the types of both arguments
must be consulted,
and types give way to sub-types.
This is the reason why we explore the combinations
float+float
, float+int
and int+float
separately.
The last of these has the largest overhead,
since the int.__add__
must be consulted and return NotImplemented
,
before float.__add__
comes up with the answer.
Even in the case float+int
where float.__add__
will succeed,
code that is ignorant of the particular types
must check that int
is not a sub-type of it,
then apply a test for NotImplemented
.
This happens for every addition,
compared with Java in which the types are statically known,
and only a handful of instructions need be executed.
Greater Strain on In-lining¶
In the binary performance tests,
we again see that Jython 2 is faster than VSJ 2
and VSJ 3 using invokeExact
,
supporting the hypothesis that the virtual method calls
are more successfully in-lined.
(See The invokeExact Barrier.)
In Jython 2 float
tests,
the difference made by having int
on the left,
and returning NotImplemented
each time is not pronounced.
The assembly code generated by the JVM is long and complex,
but it appears that having in-lined the body of PyLong.__add__
,
the compiler can see that NotImplemented
is the inevitable result,
and goes directly to PyFloat.__radd__
.
In VSJ 2, we see quite a big penalty for having int
on the left.
This deficit is only partly made up in VSJ 2 with invokedynamic
.
We speculate that the method handles are only partially in-lined
because they are too deeply nested for the JVM to do so fully.
The wrappers that test for NotImplemented
contribute to that depth,
in addition to the class guards.
The shortcoming we noted in VSJ 2 unary invokedynamic
call sites,
that they assume no re-definition of the operations may occur,
is (doubly) present in the binary case,
but we correct that in VSJ 3.
Dispatch Specific to Java Class¶
We saw a small advantage in VSJ 3 from
Dispatch Specific to Java Class in the unary case.
In the binary case,
we gain enormously from specialisation in a mixed case like int+float
.
After a guard on class,
the course of events may be plotted completely by the call site:
it is not necessary (in the case cited) to consult int
,
or to test for NotImplemented
in the handle.
To pull this off,
we have to supply quite a number of static
implementation methods.
Each implementation class of float
must be supported
as the first argument of __add__
and __radd__
,
then combined with each legitimate other operand class for the second.
class PyFloatBinops { // ...
static Object __add__(PyFloat v, PyFloat w) {
return v.value + w.value;
}
static Object __add__(PyFloat v, Double w) {
return v.value + w.doubleValue();
}
static Object __add__(PyFloat v, Integer w) {
return v.value + w.doubleValue();
}
static Object __add__(PyFloat v, BigInteger w) {
return v.value + PyLong.convertToDouble(w);
}
// ...
static Object __add__(Double v, PyFloat w) {
return v.doubleValue() + w.value;
}
// ...
static Object __radd__(PyFloat w, PyFloat v) {
return v.value + w.value;
}
// ...
There are many combinations and we use a script to generate them. The call site binds in the method for each operand class pair, under a double class guard.
Thoughts on the quartic test¶
The test quartic
provides a surprise or two.
This method asks for 5 floating operations
(3 multiplications, an addition and a subtraction).
We have seen that the time for each is not greatly different in isolation.
The pure Java version quartic_java
is noteworthy
for taking barely a nanosecond longer than a single addition.
This is discussed in Vanishing Time in Primitive Operations.
The pipelining and concurrency evident in the result
is possible when the floating-point operations are adjacent
(part of the same expression, say).
Jython 2 also achieves a time for quartic
roughly the same as its own add_float_float
,
suggesting the residual overhead (probably two type checks) is paid only once
and the in-lined code optimised as well as for the native case.
In VSJ 2, the overhead of the quartic
is basically 5 times
that of add_float_float
,
showing that there is no in-lining of the separate calls
that would bring the floating point calculation together in one place.
This is also approximately true of VSJ 2 with invokedynamic
:
the overhead relative to pure Java is 43ns,
around 3.5 times the 12ns on add_float_float
.
Evidently we do not get the remarkable concurrency seen in quartic
for Jython 2 and pure Java.
Finally in VSJ 3, the specialisation to class allows the handles to in-line fully, and we are down to 7ns total overhead relative to the pure Java expression.