Viper Emitter Array/List Error

danimede · Post by **danimede** » Wed Jun 03, 2015 6:30 pm

Hi everybody,

I'm trying to benchmark some simple applications with uPY, my configuration is based on the STM32F411 Nucleo board.
I would like to compare the three code emitters (i.e. bytecode, native and viper), with a simple integer matrix multiplication.
I already tested the code suggested here: https://www.kickstarter.com/projects/21 ... sts/665145, achieving very similar results.

Unfortunately I have two kinds of issues with my MatMul test:
1. if I use the MP_EMIT_OPT_VIPER in the mp_compile function:

Code: Select all

mp_obj_t module_fun = mp_compile(pn, source_name, MP_EMIT_OPT_VIPER, false);

I get the following runtime error:

Code: Select all

 assertion "vtype_fromlist == VTYPE_PYOBJ" failed: file "py/emitnative.c", line 1236, function: emit_native_import_name

and this append with any kind of script tested.

2. Then, even using the MP_EMIT_OPT_ASM_THUMB or MP_EMIT_OPT_NONE parameter, when I test the matrix multiplication I get this error:

Code: Select all

 assertion "vtype_index == VTYPE_PYOBJ" failed: file "py/emitnative.c", line 1434, function: emit_native_load_subscr

I've done some additional test and I verified that this issue is related to load and store data in array/list.

Just for completeness the Matrix Multiplication I tested is the following:

Code: Select all

import pyb


@micropython.viper
def matmul(first, second, multiply):
    
    row = 10
    col = 10
    sum = 0
    
    for i in range(0, row):
        for j in range(0, col):            
            for k in range(0, row):
                sum = sum + first[i][k]*second[k][j];
               
            multiply[i][j] = sum;
            sum = 0;

first = [0 for x in range(10)]
second = [0 for x in range(10)]
multiply = [[0 for x in range(10)] for x in range(10)] 

first[0] =  [1,2,3,4,5,6,7,8,9,10]
first[1] =  [11,12,13,14,15,16,17,18,19,20]
first[2] =  [21,22,23,24,25,26,27,28,29,30]
first[3] =  [31,32,33,34,35,36,37,38,39,40]
first[4] =  [41,42,43,44,45,46,47,48,49,50]
first[5] =  [51,52,53,54,55,56,57,58,59,60]
first[6] =  [61,62,63,64,65,66,67,68,69,70]
first[7] =  [71,72,73,74,75,76,77,78,79,80]
first[8] =  [81,82,83,84,85,86,87,88,89,90]
first[9] =  [91,92,93,94,95,96,97,98,99,100]

second[0] = [1,2,3,4,5,6,7,8,9,10]
second[1] = [1,2,3,4,5,6,7,8,9,10]
second[2] = [1,2,3,4,5,6,7,8,9,10]
second[3] = [1,2,3,4,5,6,7,8,9,10]
second[4] = [1,2,3,4,5,6,7,8,9,10]
second[5] = [1,2,3,4,5,6,7,8,9,10]
second[6] = [1,2,3,4,5,6,7,8,9,10]
second[7] = [1,2,3,4,5,6,7,8,9,10]
second[8] = [1,2,3,4,5,6,7,8,9,10]
second[9] = [1,2,3,4,5,6,7,8,9,10]

while True:

    matmul(first, second, multiply)
    pyb.delay(1000)

Thus I would like to ask: has anyone tested the array/list with the viper emitter?
If someone could give me some tips, it would be very appreciated.
Many Thanks in advance.

Best,
D

Post by **Damien** » Thu Jun 04, 2015 11:03 am

Viper features are still in development so there are lots of issues that you will encounter

I just fixed a few things so that your code can run correctly.

First, you don't need to set MP_EMIT_OPT_VIPER in the mp_compile function. That's done automatically when you put the "@micropython.viper" decorator at the start of a function. If you use that flag in mp_compile then *all* code is compiled in viper mode, and that will lead to many problem.

Second, I have added the ability to load/store from/to Python lists (and any other object).

Third, the only thing you need to change to make your code run is to cast the result of the multiply to an int:

Code: Select all

sum = sum + int(first[i][k]*second[k][j])

This is because the variable sum is a native viper int, but the result of the subscript of the first/second list is a Python object.

Fourth, at the moment your code uses a mix of native ops/types and Python ops/types. The for loops will be native but the lists are Python objects. To make everything native you need to do something like:

Code: Select all

@micropython.viper
def matmul(first:ptr, second:ptr, multiply:ptr):
    N = 10
    for i in range(0, N):
        for j in range(0, N):
            sum = 0
            for k in range(0, N):
                sum += first[i * N + k] * second[k * N + j]
            multiply[i * N + j] = sum

Then your lists must be flat and square, not nested.

Fifth, the above won't work just yet because viper can't do native multiplication

Checkout the test scripts in tests/micropython/viper* for some examples of viper code that does work.

Post by **Damien** » Thu Jun 04, 2015 9:19 pm

Ok, viper native multiplication now implemented. So your first example code should work if you cast the first/second lookup to an int. Eg:

Code: Select all

sum += int(first[i][k]) * int(second[k][j])

This is doing the multiply natively.

danimede · Post by **danimede** » Fri Jun 05, 2015 12:37 pm

Dear Damien,

first of all thanks a lot for all the support!

I've seen both the commits on github: the fix on the emitnative.c and also the modification to have the native multiply operation with viper emitter. Many Thanks.
Now the Integer Matrix Multiplication works fine also with the Viper emitter (I report the entire script, I hope it will help other users):

Code: Select all

import pyb

@micropython.viper
def matmul(first, second, multiply):
    
    row = 10
    col = 10
    
    for i in range(0, row):
        for j in range(0, row):    
            sum = 0        
            for k in range(0, col):
                sum += int(first[i*col+k])*int(second[k*row+j])
                
            multiply[i*row+j] = sum

row = 10
col = 10
multiply = [0 for x in range(row*col)]

first = [   1,2,3,4,5,6,7,8,9,10,           \
            11,12,13,14,15,16,17,18,19,20,  \
            21,22,23,24,25,26,27,28,29,30,  \
            31,32,33,34,35,36,37,38,39,40,  \
            41,42,43,44,45,46,47,48,49,50,  \
            51,52,53,54,55,56,57,58,59,60,  \
            61,62,63,64,65,66,67,68,69,70,  \
            71,72,73,74,75,76,77,78,79,80,  \
            81,82,83,84,85,86,87,88,89,90,  \
            91,92,93,94,95,96,97,98,99,100]

second = [  1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10,      \
            1,2,3,4,5,6,7,8,9,10]

while True:
    matmul(first, second, multiply)
    pyb.delay(1000)

At this point I can share with all of you my profiling results of this micro test application:

uPY ByteCode: 3.6M cycles
uPY Native: 2.4M cycles
uPY Viper: 558K cycles
C Native: 10K cycles

As you can see the expected trend is, more or less, confirmed. But as you can also see, there is a huge overhead wrt the same code executed on the same MCU with a native C implementation (obviously same input dataset).
Of course I expected to see some overhead, but at least ~50 time slower is more than what I expected!!

Is this behavior something already known/expected? And someone know why? I think that understanding the reason of such loss we could improve performance... and share the benefit

I also started to benchmark another simple application, but this second app deals with floating point (single precision) and I have found even a bigger overhead... but for this I think it would be better open another topic

Thanks once again,
D

dhylands · Post by **dhylands** » Fri Jun 05, 2015 3:12 pm

You have to keep in mind that even with viper, it's still including the time to parse and compile the python code.

danimede · Post by **danimede** » Fri Jun 05, 2015 6:40 pm

Hi Dave,

dhylands wrote:You have to keep in mind that even with viper, it's still including the time to parse and compile the python code.

In general what you say is right. But my scenario is a bit different, let me explain how I performed the evaluation.

1. I developed a very simple method for the dev module, in order to be used in the uPY script. Thus, I can enable/disable the HW counters on the STM32F411 Nucleo board and read the number of cycles elapsed. Moreover, when the counting is enabled I'm able to disable all the maskable interrupts. I developed the following functions:

Code: Select all

    dev.HWCOUNTER.reset_timer()
    dev.HWCOUNTER.start_timer()
    dev.HWCOUNTER.stop_timer()
    dev.HWCOUNTER.getCycles()

2. Thus, I profiled and tested all the new methods. Firstly I measured the overhead introduced by the measurement itself: around 2.7K cycles.
3. Then I tested the performance counter methods profiling the pyb.delay(). I tested a bunch of configurations and the measurements seem to be very precise for all of them.
4. Lastly, I used such methods to evaluate the matmul(), in the following way:

Code: Select all

while True:
    gc.collect()
    dev.HWCOUNTER.reset_timer()  # here we call also: portDISABLE_INTERRUPTS()
    dev.HWCOUNTER.start_timer()
    
    matmul(first, second, multiply)

    dev.HWCOUNTER.stop_timer()  # here we call also: portENABLE_INTERRUPTS()
    cycles = dev.HWCOUNTER.getCycles()
    print('MatMul: uPY Cycles: %d\r' % (cycles))

Thus, I think that in my evaluation the time to parse and compile the python code is not included.
As consequence, I'm still confused about the huge overhead measured.

Of course your tip was very appreciated

Thanks,
D

pfalcon · Post by **pfalcon** » Fri Jun 05, 2015 7:20 pm

C Native: 10K cycles

If you rewrite it in an optimized assembler, it will be even faster. MicroPython allows you to use assembler, so if you're concerned with just performance, that's the easiest way to follow. However, if you want to contribute to further development of viper code emitter, please read on.

And someone know why? I think that understanding the reason of such loss we could improve performance... and share the benefit

Well, everyone knows why - because Python is a high-level dynamic language. So it's necessarily has more overhead. And to get more performance, programs should be written in an appropriate way. For example, it makes no sense to talk about numeric processing performance and use Python lists. List is a high-level data structure suitable to store any kind of object. Keeping numbers in list for numeric processing will of course have subideal performance.

Instead, numbers should be kept in arrays. But you may find that viper doesn't natively support arrays beyond bytearray (there's a workaround), and then supports only 8 and 16 bit items. Other issue with viper is that there's not enough debugging support, so hacking on it is complicated. If you already started to look into commits, that's the right way to approach the problem and understand how it works. Then if you make any improvements, please share them, so other people then continue from there!

danimede · Post by **danimede** » Sat Jun 06, 2015 5:49 pm

Dear pfalcon,

If you rewrite it in an optimized assembler, it will be even faster. MicroPython allows you to use assembler, so if you're concerned with just performance, that's the easiest way to follow.

Not exactly, I'm concerned on the best trade-off between usability and performance. Since the Viper emitter concerns performance, I'm interested to figure out the best way to use it (and to improve it). The C Native version is just to have some additional reference, otherwise how can we have an idea about how good it is? C code is not so high-level as PY but not even so low as assembler, thus I thought it could be a good candidate (of course any better idea is very welcome).

Well, everyone knows why - because Python is a high-level dynamic language.

Well, I think that this is the answer to the question "Why uPY should have some overhead", my question concern the huge overhead I found... may be my feeling is wrong (I mean saying "huge") and in such case a technical explanation would be great.

List is a high-level data structure suitable to store any kind of object. Keeping numbers in list for numeric processing will of course have subideal performance.

This kind of comments could be very useful! I mean, if there is a way to do better I will be very happy to learn how to do it. Any tip in this direction would be very very appreciated!

Thanks for all your time,
D

Post by **Damien** » Wed Jun 10, 2015 11:04 am

danimede wrote: if there is a way to do better I will be very happy to learn how to do it.

Here is how: you use native viper pointers. Add the following code to your script above:

Code: Select all

@micropython.viper
def matmul_fast(first:ptr16, second:ptr16, multiply:ptr16):
    row = const(10)
    col = const(10)
    for i in range(0, row):
        for j in range(0, row):
            sum = 0
            for k in range(0, col):
                sum += first[i*col+k] * second[k*row+j]
            multiply[i*row+j] = sum

from array import array
multiply_fast = array('h', multiply)
first_fast = array('h', first)
second_fast = array('h', second)

t = pyb.micros()
matmul_fast(first_fast, second_fast, multiply_fast)
t = pyb.elapsed_micros(t)
print(t, "us")

The viper code is now fully native. My benchmarking (using pyb.elapsed_micros) gives:

bytecode emitter: 24400us
native emitter: 19800us
viper emitter: 4200us
viper with matmul_fast: 675us

This is no longer a big difference to native C. The difference that is left over is optimisation of the emitted assembly. If you compile your C code with -O0 then you should get similar numbers to viper with matmul_fast.

Note that only ptr8 and ptr16 are supported at the moment. Feel free to add support for ptr (word sized pointers)!

MicroPython Forum (Archive)

Viper Emitter Array/List Error

Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error

Re: Viper Emitter Array/List Error