Wednesday, February 29, 2012

Simple Performance Metrics

I've been asked many times recently "How does the performance of clojure-py compare to clojure on the JVM?". I think the best way to go about this is to define what it is that the pypy JIT excels at, this will help us better understand our performance metrics.

The PyPy jit is a tracing jit. I'm not going to go into all the ins and outs of tracing jits, but suffice it to say, it's a totally different way of writing a jit than what is used in Hotspot. What PyPy excels at is optimizing dynamic code, or put in Clojure terms, anytime you think "oops...reflection might be happening here" this is where PyPy can step in and save the day. To start with, let's look at this Clojure (jvm) code:

(ns examples.speedtest)

(definterface IStrValue
    (strvalue []))

(deftype A [x]
    (strvalue [self] (str x)))

(deftype B [x]
    (strvalue [self] (str (+ x 1))))

(deftype C [x]
    (strvalue [self] (str (dec x))))

(defn getstr [obj]
    (.strvalue obj))

(time (dotimes [x 10000000]
    (getstr (A. x))
    (getstr (B. x))
    (getstr (C. x))))

Execution time on my machine is around 36 sec. If we set "warn on reflection" on, the compiler will complain that (.strvalue) is causing reflection. Now, the proper way to handle this is to give a type hint to getstr:

(defn getstr [^IStrValue obj]
    (.strvalue obj))

Now the execution time is only 3 sec! 10x speed boost by a single tag. This isn't bad.

But let's see how clojure-py fares:

(ns examples.speedtest)

(deftype A [x]
    (strvalue [self] (str x)))

(deftype B [x]
    (strvalue [self] (str (+ x 1))))

(deftype C [x]
    (strvalue [self] (str (dec x))))

(defn getstr [obj]
    (.strvalue obj))

(time (dotimes [x 10000000]
    (getstr (A. x))
    (getstr (B. x))
    (getstr (C. x))))

Execution time: 8 sec

To me, the boon here is that the clojure-py code is more dynamic. We don't have to worry about interfaces, inheritance,  or type inference. We simply write our code in an idiomatic way, and let the jit figure the rest out. 

But how does clojure-py compare when it comes to raw number performance? Well there it's not quite so good a story. In most of my tests clojure-py is about 2-8x slower than the JVM version, but almost exactly the same speed as Python code.

So there's the performance numbers. If you're willing to write interfaces, provide type hints, etc. Then the JVM will be faster. But if all you want to do is pound out some code with acceptable performance, then maybe clojure-py is the way to go. And's not Java!

(sorry, couldn't resist)

Saturday, February 25, 2012

Vars and Calling Methods

Vars in clojure are a simple way to name some information. We create vars with the def special form:

(def answer 42)

Now, a clojure beginner may think that we've simply assigned the name "answer" to the number 42. However, this is not exactly true. Instead, we've create a var or, more simply a "box", into which we've put the number 42, and given this box the name "answer".

Initially, translating this to Python seems quite straight forward: we simply vars into the globals dictionary in a python module, then we can refer to them by name. And, this is exactly the way we go about it in our current compiler.

However, this presents a problem when we start talking about two other ideas behind vars: consistent referencing, and binding:

Consistent referencing:
There is a small bug in our current clojure-py implementation. Let's say we define answer in a module called "a" and then import this module into a different module called "b". The problem that we've created, is that we now have a inconsistent way of referencing this value. If module "a" ever redefines the answer variable, b will never see these changes. We've copied the value, instead of copying the reference to this value.

In clojure-jvm, this bug is not shown, since all variables are stored into "Var" objects. These objects are copied and exported to other namespaces, then the clojure compile automatically dereferences these vars whenever the var's data is accessed. So this code:

(def answer-plus-one (inc answer))

Is really compiled by the compiler into something much more like this:

(def answer-plus-one ((.deref inc) (.deref answer)))

We could do this in python, and this might be a possible answer...but let's look at the next problem:

Clojure allows us, on a per-thread basis,  to re-assign the value of certain vars;

(def p (binding [answer 0] answer))
;; p = 0

The way this is acceived in clojure is by defining the var's deref function as something similar to this:

(.deref [this]
    (if (bindings-exist)
         (let [v (get bindings this-var-name)]
                (if v v this-var-value))

Ouch! so when we have a binding, we're forced to do a dictionary look-up? That sounds painfully slow. But perhaps there's ways to optimize this. Firstly, we must realize that vars don't often change. Most vars are "static" and should never be mutated, in fact, it may be possible to restrict them to never change....ever, unless defined to be "dynamic" vars. Secondly, we don't have to implement bindings as dictionaries, instead we could implement them as dynamic objects and use getattr/setattr on them. In PyPy using objects instead of dicts for situations like this could be much, much faster. about some performance metrics:

In this example, we'll have some code count from 1 to 40 million, and to this 100 times. We'll be looking at different ways of supplying the + function to the loop and see what results in the fastest times:

(def highend 40000000)
(def interations 10)

(defn our-inc [x]
    (py.bytecode/BINARY_ADD x 1))

Method 1 - Global defs
Here we take the Python way and simply pass in the variables as if this was pure Python code. Since this is the way the clojure-py compiler currently acts, this will be a good baseline.

(defn load-global-test [highend]
    (loop [x 0]
        (if (< x highend)
            (recur (our-inc x)))))

(print "load-global-test")
(time (dotimes [x interations] (load-global-test highend)))

Result: 6053.08794975 msecs

Method 2 - Local defs
What if we passed these values in as arguments, then referred to them as local variables, instead of global variables?

(defn with-local-test [f highend]
    (loop [x 0]
        (if (< x highend)
            (recur (f x)))))

(print "with-local-test")
(time (dotimes [x interations] (with-local-test our-inc highend)))

Result: 6053.04908752 msecs

Well that did all of nothing. Let's try something else.

Method 3 - Vars without bindings
In this case, let's box up the function into a var and call it with .deref the way clojure does:

(deftype V [itm]
    (deref [self] itm))

(def VInc (V our-inc))

(defn with-var-test [v highend]
    (loop [x 0]
        (if (< x highend)
            (recur ((.deref v) x)))))

(print "with-var-test")
(time (dotimes [x interations] (with-var-test VInc highend)))

Result:  8084.71488953 msecs

Method 4 - Vars with bindings
So if vars aren't much slower, what sort of hit to we take by using bindings? This will basically be putting the variable into a object that is put into a var. A bit ugly, but we need it if we are going to support the bindings form:

(deftype V2 [itm bindings]
    (deref [self] (.deref bindings)))

(def VInc2 (V2 nil (V our-inc)))

(defn with-binding-test [v highend]
    (loop [x 0]
        (if (< x highend)
            (recur ((.deref v) x)))))

(print "with-binding-test")

(time (dotimes [x interations] (with-binding-test VInc2 highend)))

Result:  14154.5939445 msecs

Using bindings is actually about 2x slower than not using bindings. We may want to be a bit careful how we go about defining dynamic vars then.

Method 5 - Methods as constants

If we say that non-dynamic vars will never change, why not just load their contents as constants via the bytecode "LOAD_CONST"?

(defn with-const-test [highend]
    (loop [x 0]
        (if (< x highend)
            (recur ((py.bytecode/LOAD_CONST (fn [i] (py.bytecode/BINARY_ADD i 1))) x)))))

(print "with-const-test")

(time (dotimes [x interations] (with-const-test highend)))

Result: 6053.23505402 msecs

Not much faster at all

Method 6 - Inlined functions
Finally, let's try to simply inline the call as a direct call to the bytecode BINARY_ADD

(defn with-inline-test [highend]
    (loop [x 0]
        (if (< x highend)
            (recur (py.bytecode/BINARY_ADD x 1)))))

(print "with-inline-test")

(time (dotimes [x interations] (with-inline-test highend)))

Result: 6048.96688461 msecs

Once again, practically no difference.

So why are some of the results so similar. Well first of all, I performed these benchmarks on PyPy. PyPy implements a tracing JIT, and by design tracing JITs tend to unroll loops and inline code. So, if the JIT automatically inlines all function calls, there's not much use inlining them ourselves.

What is the takeaway from this then? Quite simply, static vars should be handled as global variable, and dynamic vars should be boxed. The compiler rules will then be defined thusly:

1) any vars not marked as {:dynamic true} are considered static. They cannot be redefined (the compiler will try to enforce this), and cannot have bindings applied to them.
2) any vars marked as {:dynamic true} are considered dynamic. They can be redefined, and in general, will be about 30% slower than static vars. Once bindings are applied they will be over 100% slower than static vars

When compiling code the compiler will attempt to resolve symbols in this order:

1) the compiler will check to see if the symbol is a bytecode
2) the compiler will look for a global, if the global is found, and is not a var, a standard LOAD_GLOBAL will be used. If it is a var, the var will be loaded with LOAD_GLOBAL, then a LOAD_ATTR "deref" will be called on the var.
3) if no global is found, and the symbol is not a special form, an error will be thrown.