ulab, or what you will - numpy on bare metal

chuckbook · Post by **chuckbook** » Thu Oct 03, 2019 1:35 pm

v923z wrote:
I think there is a misunderstanding stemming from this post:

chuckbook wrote: ↑
Fri Sep 27, 2019 1:13 pm
Very impressive! Thanks for sharing this. 1k FFT (SP) in ~0.8ms on PYBD, not bad.
In the original post, I quoted a measurement of 1.948 ms, and claimed that the FFT could be gotten in less than 2 ms, and not 0.8 ms. So, it is only a factor of two in speed, and basically no overhead in RAM, because with the exception of a handful of temporary variables, the transform happens in place. It is also true that I did not overclock the CPU, so if one is to be fair, then the gain is a bit more: if I extrapolate your numbers, if the CPU is clocked at 168 MHz, then the FFT in assembly would cost around 4.3 ms.

Just to confirm, I verified the 0.8ms execution time on a PYBD767 running at default clock speed of 216 MHz. What was the hardware that gave 2ms execution time?

v923z · Post by **v923z** » Thu Oct 03, 2019 2:00 pm

chuckbook wrote: ↑
Thu Oct 03, 2019 1:35 pm

v923z wrote:
I think there is a misunderstanding stemming from this post:

chuckbook wrote: ↑
Fri Sep 27, 2019 1:13 pm
Very impressive! Thanks for sharing this. 1k FFT (SP) in ~0.8ms on PYBD, not bad.
In the original post, I quoted a measurement of 1.948 ms, and claimed that the FFT could be gotten in less than 2 ms, and not 0.8 ms. So, it is only a factor of two in speed, and basically no overhead in RAM, because with the exception of a handful of temporary variables, the transform happens in place. It is also true that I did not overclock the CPU, so if one is to be fair, then the gain is a bit more: if I extrapolate your numbers, if the CPU is clocked at 168 MHz, then the FFT in assembly would cost around 4.3 ms.
Just to confirm, I verified the 0.8ms execution time on a PYBD767 running at default clock speed of 216 MHz. What was the hardware that gave 2ms execution time?

Thanks for the report! I measured on a pyboard v.1.1, the gold standard. The pybd767 has a different processor. That would explain the difference.

chuckbook · Post by **chuckbook** » Thu Oct 03, 2019 3:11 pm

@v923z: Thanks for the info.
BTW, using our build settings (gcc version 8.2.0, -O2) the test gave 1.8ms on PYBV11.
O2 results in bigger code size but it makes sense to use it if there is some spare flash available.

v923z · Post by **v923z** » Thu Oct 03, 2019 3:34 pm

chuckbook wrote: ↑
Thu Oct 03, 2019 3:11 pm
@v923z: Thanks for the info.
BTW, using our build settings (gcc version 8.2.0, -O2) the test gave 1.8ms on PYBV11.
O2 results in bigger code size but it makes sense to use it if there is some spare flash available.

Good to know. I have used the standard settings, beyond passing the USER_C_MODULES parameter to make, I haven't modified anything in the makefile. My gcc version is 7.4.0. I don't know, whether that changes anything.

As for the code size, did you actually measure that, or do you rely on the claim of gcc? In the past, I used to compile a lot for atmega, and the statement was the same, -O2 should make faster, but slightly bigger firmware. My experience was that the firmware almost always got slightly smaller, and I also did gain in speed. Hence my question.

chuckbook · Post by **chuckbook** » Thu Oct 03, 2019 5:03 pm

Here are the code sizes of -O2 and -Os build options.

Code: Select all

   text    data     bss     dec     hex filename
 463704      40   28052  491796   78114 build-PYBV11_O2/firmware.elf
 424484      40   28052  452576   6e7e0 build-PYBV11/firmware.elf

v923z · Post by **v923z** » Thu Oct 03, 2019 5:12 pm

Chuck,

chuckbook wrote: ↑
Thu Oct 03, 2019 5:03 pm
Here are the code sizes of -O2 and -Os build options.
Code: Select all
   text    data     bss     dec     hex filename
 463704      40   28052  491796   78114 build-PYBV11_O2/firmware.elf
 424484      40   28052  452576   6e7e0 build-PYBV11/firmware.elf

You are probably compiling other modules into the firmware, because my size with the -O2 switch is

Code: Select all

   text	   data	    bss	    dec	    hex	filename
 347832	     40	  27888	 375760	  5bbd0	firmware.elf

Without ulab, the size is about 16 kB smaller. I can't believe that the different compiler (gcc 8.2 vs. 7.4) would explain such a significant difference.

chuckbook · Post by **chuckbook** » Fri Oct 04, 2019 9:18 am

Don't get confused about the absolute size of the code. There are a lot of additional features included. I just wanted to demonstrate code size increase for -Os and -O2.

v923z · Post by **v923z** » Fri Oct 04, 2019 10:05 am

chuckbook wrote: ↑
Fri Oct 04, 2019 9:18 am
Don't get confused about the absolute size of the code. There are a lot of additional features included. I just wanted to demonstrate code size increase for -Os and -O2.

Thanks!

v923z · Post by **v923z** » Fri Oct 04, 2019 10:26 am

Hi all,

I have tried to clean up the code for binary operations, and run into a fundamental problem with commutative operators. Namely, this can be handled

Code: Select all

import ulab

a = ulab.ndarray([1, 2, 3])
a*5

because the evaluation of the product operator begins with a, which is an ndarray, therefore, I handle the code for that. However, if I try to turn the operands around like

Code: Select all

import ulab

a = ulab.ndarray([1, 2, 3])
5*a

then I end up with a fatal error:

Code: Select all

Traceback (most recent call last):
  File "/dev/shm/micropython.py", line 7, in <module>
TypeError: unsupported types for __mul__: 'int', 'ndarray'

and the reason for that is that evaluation now begins with 5, and an int should not be multiplied by an object that is not a scalar. A similar problem should exist for lists, but that is solved in the mp_obj_int_binary_op_extra_cases function of objint.c, (https://github.com/micropython/micropyt ... int.c#L370) where the operands are simply swapped, and with that the evaluation order is correct.

Now, runtime.c contains the generic_binary_op flag, (https://github.com/micropython/micropyt ... ime.c#L563) where one could, in principle, hook into mp_binary_op, but the flag is defined for a couple of specific cases, and is not generic in this sense.

My question is, should not there be a case at the end of the switch of mp_binary_op that would simply point to generic_binary_op in the instance, when everything before that failed? Or are there other mechanisms of overriding the standard binary operator from the user module itself (i.e., without having to modify the micropython code base)?

In numpy, an ndarray can be multiplied by a scalar or another ndarray, irrespective of the order of the operands. I think, it would be great, if we could support that.

Thanks,
Zoltán

pythoncoder · Post by **pythoncoder** » Sat Oct 05, 2019 11:11 am

Would it be possible to trap the exception; if it occurs use the integer to instantiate an object for which you define __mul__ etc?

MicroPython Forum (Archive)

ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

commutative operations

Re: ulab, or what you will - numpy on bare metal