Benchmarks

Turbinenreiter · Post by **Turbinenreiter** » Wed Aug 13, 2014 8:31 pm

Ok, here is the function I used to test:

def test():
    l = 10000000
    s = 0
    for i in range(l):
        s = s + 1
    return s

And these are the results:
pyboard:

standard: 33.061 s
native 15. 962 s
viper 1.906 s

unix

micropython standard: 0.48 s |1088 Kbytes
micropython viper: fails
pypy2: 0.06 s | 18391 Kbytes
cpython3: 0.88s | 5764 Kbytes

So:

on the pyboard, viper is 17 times faster than standard
viper on the pyboard is only 4 times slower than standard micropython on an x86 (from 2009 - 2x2.5GHz)
standard micropython on x86 is already faster than cpython3
pypy is incredible fast, and the difference becomes bigger the longer the function runs. but it's also uses a lot more memory than the others
micropython is tiny. it needs 17 times less memory than pypy and 5 times less than cpython

If you have suggestions for more functions to test, please post.

blmorris · Post by **blmorris** » Wed Aug 13, 2014 11:03 pm

I just tried your integer code on my computer, running the unix port.
I also tried the same thing with floating point:

Code: Select all

def test_f():
    l = 10000000
    s = 0.0
    for i in range(l):
        s = s + 1.0
    return s

This was my timing function:

Code: Select all

import time
def runtime(func):
     a = time.time()
     b = func()
     print(time.time() - a)
     return b

My results on the unix uPy port:

Code: Select all

>>> runtime(test)
0.8763270378112793
10000000
>>> runtime(test_f)
92.259383201599121
10000000.0

Yes, my unix uPy took 92 seconds to compute the floating point version.
On CPython3.3:

Code: Select all

>>> runtime(test)
1.4329819679260254
10000000
>>> runtime(test_f)
1.1191208362579346
10000000.0

There were several runs, but on CPython the floating point version was consistently a bit faster.

On the stmhal port (not a pyboard, but my own compatible hardware, should run the same) I got the same result as you for the integer program.
For the float program, I needed to change it so that l=100000 (10^5 instead of 10^7) and it ran in 22.616 seconds.

Any insight as to why uPy is struggling with the floating point calculations? I am running under OSX, I assumed that both of my versions were compiled with floating point support.

blmorris · Post by **blmorris** » Thu Aug 14, 2014 2:49 am

I also ran the following set of commands on stmhal, cut and pasted from a REPL session and edited a bit for clarity:

Code: Select all

>>> def runtime(func, len):               
...     a = pyb.millis()                  
...     b = func(len)                     
...     print((pyb.millis() - a)/1000)    
...     return b
... 
>>> def test_f(len):
...     s = 0.0
...     for i in range(len):
...         s = s + 1
...     return s
... 
>>> runtime(test_f, 100000)  
11.749
100000.0
>>> def test_f(len):         
...     s = 0.0             
...     for i in range(len):
...         s = s + 1.0     
...     return s            
... 
>>> runtime(test_f, 100000) 
22.544
100000.0
>>> def test_i(len):        
...     s = 0               
...     for i in range(len):
...         s = s + 1       
...     return s            
... 
>>> runtime(test_i, 100000) 
0.319
100000

I thought this was interesting because changing the line from "s = s + 1.0" to "s = s + 1" cut the execution time in half. Note that s was already a float. The pure integer version still executes very quickly.

stijn · Post by **stijn** » Sun Aug 17, 2014 3:47 pm

my guess would be that uPy treats integers as a special type, while floats are normal objects. So for the float version in every loop it's going to have to allocate at least one object and that isn't exactly cheap, or rather way more epxensive then just the sum of the numbers.

dhylands · Post by **dhylands** » Sun Aug 17, 2014 4:32 pm

I also noticed that this:

Code: Select all

def test_f3(len):
    s = 0.0
    one = 1.0
    for i in range(len):
        s = s + one
    return s

runs approx twice as fast as this:

Code: Select all

def test_f2(len):
    s = 0.0
    for i in range(len):
        s = s + 1.0
    return s

stijn · Post by **stijn** » Sun Aug 17, 2014 6:28 pm

Seems likely than that the 1.0 means having to allocate a new float object for every iteration?

blmorris · Post by **blmorris** » Mon Aug 18, 2014 11:53 pm

I agree that the bottleneck seems to be in allocating float objects rather than the floating point computations themselves, and I devised some more tests to try to demonstrate that.
(I thought about creating a new topic for 'Floating Point issues' but there were already responses here by the time I got back to this.)

Just so the point doesn't get lost at the end of a long and detailed post- I'm curious if anyone else is concerned about being able to write efficient FP routines in uPy and thinks this is worthy of raising an issue on GitHub.

The following functions can demonstrate what @dhylands and I found:

Code: Select all

import pyb
def runtime(func, len):
    a = pyb.millis()  # Starting time
    b = func(len)               
    print('Test ran for', (pyb.millis() - a)/1000, 'seconds')
    return b

def test_f1(len):
    s = 0.0
    x = 1.0
    for i in range(len):
        s = s + x
    return s

def test_f2(len):
    s = 0.0
    for i in range(len):
        s = s + 1.0
    return s

def test_f3(len):
    s = 0.0
    x = 1.0
    for i in range(len):  
        s = s + (x * x)            
    return s

With the following results:

Code: Select all

>>> runtime(test_f1, 100000)
Test ran for 11.668 seconds
100000.0
>>> runtime(test_f2, 100000)
Test ran for 22.411 seconds
100000.0
>>> runtime(test_f3, 100000)
Test ran for 22.435 seconds
100000.0

The tests show that executing "s = s + 1.0" takes the same time as executing "s = s + (x * x)", probably because a float object is created each time we execute "1.0" and when we execute "x * x". Executing "s = s + (1.0 * x)" takes 32.894 seconds, 3 times as long as "s = s + x" because there are two intermediate float objects made, etc.

To really push the floating point processing on the chip we need a function that does its FP calculations at the C level rather than the Python level:

Code: Select all

import math
from math import cos
def test_cos1(len):
    s = 0.0
    for i in range(len):
        s = s + cos(i)
    return s

def test_cos2(len):
    s = 0.0
    x = math.pi/3
    for i in range(len):
        s = s + cos(i*x)
    return s

Results:

Code: Select all

>>> runtime(test_cos1,100000)
Test ran for 24.941 seconds
1.03239
>>> runtime(test_cos2,100000)
Test ran for 36.563 seconds
0.0476162

The way I see it, "test_cos1" (add+cos) is analogous to "test_f1" (add) with the FP-intensive cosine operation contributing about 13 seconds to the execution time, and "test_cos2" (add+mult+cos) is analogous to "test_f3" (add+mult), with the cosine again adding about 13 seconds of execution time. (Side note: "test_cos2" also reveals rounding errors in the single-precision cosine computation; each iteration of "cos(i*x)" should yield 1.0, 0.5, -0.5, or -1.0; but larger values of i lead to noticeable rounding errors.)

Note-edited so that code examples can be cut-and-paste directly into a fresh REPL

blmorris · Post by **blmorris** » Tue Aug 19, 2014 3:15 am

One more point, probably relevant to all algorithms that allocate large numbers of objects-

I started looking at the memory usage and found that one of my test cases was using up over 30 bytes per iteration (I don't really know if this is especially large.)
Obviously longer runs were exhausting memory and requiring the garbage collector to be called several times to recover memory. I estimated memory usage per iteration by doing shorter test runs (~1000 iterations or so) and running gc.mem_free() before and after each run. What I found interesting was that the runs which took the longest were not necessarily the ones where gc was called, but rather the runs which came closest to exhausting the available memory. Sometimes the next run would quickly exhaust remaining memory, trigger a call to gc.collect(), and then apparently run much faster when a larger pool of memory was available. Of course, the runs that took the most time would be the ones that completely exhausted memory only when nearly complete, and then trigger a call to gc.collect() to finish.

The shorter version- allocating memory for objects takes time; it seems to take a lot more time (maybe >3x more time?) when there is less memory to allocate.

stijn · Post by **stijn** » Tue Aug 19, 2014 8:15 am

blmorris wrote:I'm curious if anyone else is concerned about being able to write efficient FP routines in uPy and thinks this is worthy of raising an issue on GitHub

Well it surely would be interesting to see if the developper(s) have an idea of how this could be solved (and if the difference with CPython is solely because of the latter using refcounting instead of gc, or if they also use some small object trick).
On the other hand: there is already @micropython.viper etc which gives way better performance. Furthermore the only way to get as efficient as possible is probably writing your routine in assembly, C or C++.
The last thing is what I'm doing anyway: uPy is just the 'control' layer over an application that is mainly written in C++, especially all FP loops, as it's nearly impossible to beat it's raw speed.

blmorris · Post by **blmorris** » Tue Aug 19, 2014 3:57 pm

stijn wrote:On the other hand: there is already @micropython.viper etc which gives way better performance. Furthermore the only way to get as efficient as possible is probably writing your routine in assembly, C or C++.
The last thing is what I'm doing anyway: uPy is just the 'control' layer over an application that is mainly written in C++, especially all FP loops, as it's nearly impossible to beat it's raw speed.

That makes sense- if you absolutely need to maximize performance uPy provides several mechanisms to pursue this; while straight uPy is fine you want simplicity and absolute performance isn't a huge issue. Micropython has been a godsend for me in that respect; as a hardware designer I have been dependent on a colleague to write code for work projects and basically without any means to implement my ideas for personal projects. Of course a high-level language like Python will come with performance tradeoffs, it has been an interesting learning exercise to figure out exactly what those are. In that vein, I have started a project to quantify the cost of allocating memory for objects in stmhal, at this early stage I think that it may suggest strategies that others could use to optimize code. I hope to organize some results for posting in a few days.
-Bryan

MicroPython Forum (Archive)

Benchmarks

Benchmarks

Re: Benchmarks

Re: Benchmarks

Re: Benchmarks

Re: Benchmarks

Re: Benchmarks

Re: Benchmarks - Floating point issues

Re: Benchmarks

Re: Benchmarks - Floating point issues

Re: Benchmarks - Floating point issues