ulab, or what you will - numpy on bare metal

stijn · Post by **stijn** » Thu Sep 26, 2019 1:57 pm

Here is an ugly example: https://github.com/v923z/micropython-ul ... ray.c#L402

Just some thouhghts line 407-415 can be put away in one function like bool get_float_value(uint8_t* typecode, float*value) and then used whenever needed. The other places where the same loop gets repeated for each typecode and operation is where C++ templates would really shine.. Personally I'd first investigate how to do that because I know it would shorten the code tremendously, but in C I'm not sure and YMMV. Probably a macro like (pseudocode)

Code: Select all

#define DO_LOOP(TYPE, OPERATION) \
    TYPE *outdata = (TYPE *)out->data->items; \
    for(size_t i=0; i < ol->data->len; i++) { \
        value = ndarray_get_float_value(or->data->items, or->data->typecode, i); \
        outdata[i] = ndarray_get_float_value(ol->data->items, ol->data->typecode, i) OPERATION value; \
    } \
    
....
    if(op == MP_BINARY_OP_ADD) { //Note: this one outside of the loop should be faster than inside of it!
        if(typecode == NDARRAY_UINT8) {
            DO_LOOP(uint_8, +)
        }
        ... and so on
    }
    ... and so on

(also the only difference between the case where you have a single float or int vs the case where you have an ndarray and so have to call ndarray_get_float_value for each iteration, could be merged by having that as macro argument, not shown here)

but then you still need to branch manually on the typecode to type mapping and on the op to OPERATION mapping, no idea how do do that at runtime in C; for the operation you could have a lookup table for function pointers, but I don't know for the type <-> typecode things. Something with unions perhaps.

The upcasting function could be replaced with a lookup table in which the key would be an uint16_t which is (type_left << 8 | type_right), or something along those lines. I can probably come up with more ideas like these but sorry, I just don't have the time..

v923z · Post by **v923z** » Thu Sep 26, 2019 2:16 pm

stijn wrote: ↑
Thu Sep 26, 2019 1:57 pm
The other places where the same loop gets repeated for each typecode and operation is where C++ templates would really shine.. Personally I'd first investigate how to do that because I know it would shorten the code tremendously, but in C I'm not sure and YMMV. Probably a macro like ...

Thanks, I think this is the way to go. You are absolutely right, the problem is that the same for loop is repeated, but for different types. I should have thought of macros earlier. In the heat of the moment, somehow I missed this option

Cheers,

Zoltán

v923z · Post by **v923z** » Thu Sep 26, 2019 5:39 pm

I think stijn's suggestion with the macros is going to be OK. Especially useful was the observation that the for loop should be the innermost statement. I believe, with that one can save a lot of time. Moreover, with the macro, one can get rid of the ndarray_get_float_value() function call, which shouldn't hurt, either. I have a working implementation, but I need a bit of time to clean it up.

But before jumping off the deep end, here is a question:

Damien wrote: ↑
Thu Sep 26, 2019 4:22 am
People who already use numpy already know about the subtleties like upcasting and so it'd make sense to just copy the numpy behaviour so people don't need to re-learn.

I have just tried out, numpy has the following upcasting (coercion) rules:

A float always results in a float, independent of the type of the other operand. The rest looks like this:

Code: Select all

uint8 + int8 => int16,

uint8 + int16 => int16

uint8 + uint16 => uint16

int8 + int16 => int16

int8 + uint16 => int32

uint16 + int16 => int32

I would like to propose that we work on the assumption that a microcontroller should process "small" numbers, and then substitute int32 with float. (A float is 32 bits wide, so we don't lose RAM by doing so.) That would bring the number of cases to 25. If one keeps int32, we end up with 36. Frankly, I don't think that that is worth the trouble, and the almost 50% overhead. However, if people do insist, I can leave int32.

Cheers,

Zoltán

chuckbook · Post by **chuckbook** » Fri Sep 27, 2019 11:13 am

Very impressive! Thanks for sharing this.
1k FFT (SP) in ~0.8ms on PYBD, not bad.

pythoncoder · Post by **pythoncoder** » Sat Sep 28, 2019 7:11 am

I'm astounded by these figures. I've just run my assembler solution on Pyboard D at 214MHz. A 1K DFT takes 3.5ms (on SF2W or SF6W). I never pretended my code was optimised, but the fact that a modern compiler running the same algorithm beats it by a factor of four is a revelation.

My library was written for instrumentation applications and it has some relevant features. Namely integration with Pyboard ADC's, window functions, forward and reverse transforms, Cartesian and polar transforms with optional dB conversion.

But the core algorithm clearly needs rewriting in C: not only would it be faster, it would be portable.

chuckbook · Post by **chuckbook** » Sat Sep 28, 2019 4:05 pm

I'm also astounded about the capabilities of recent compilers. It's sometimes very hard to beat them with assembler.
Especially on cortex-m7 there are a lot of traps for us veteran programmers

However, a factor of four is quite something.
I definitely will have to do some tests.

Note:
After looking at generated cortex m7 code for fft_kernel the main reasons for increased performance of ulab.fft are:
- avoid subroutine calls
- full float implementation
- full utilization of m7 instruction set

Note2:
All tests were done using -O2 optimization during MPY build.

pythoncoder · Post by **pythoncoder** » Sun Sep 29, 2019 2:16 pm

Interesting. The inline assembler currently has the merit of convenience (no need to build the firmware). When dynamically loadable C modules land there may be little reason to use it.

The sole fact that the inline assembler only supports a subset of the Arm Thumb instruction set probably condemns it to being slower in most applications. To add to that the complexity of modern CPU design probably means a good compiler will beat the efforts of most mere mortals.

In the days of the Intel 8086 I reckoned that the best assembler I could write would beat the best C by a factor of 2. Processors were simple and compilers were dumb.

v923z · Post by **v923z** » Wed Oct 02, 2019 1:35 pm

pythoncoder wrote: ↑
Sat Sep 28, 2019 7:11 am
I'm astounded by these figures. I've just run my assembler solution on Pyboard D at 214MHz. A 1K DFT takes 3.5ms (on SF2W or SF6W). I never pretended my code was optimised, but the fact that a modern compiler running the same algorithm beats it by a factor of four is a revelation.

Peter, @chuckbook,

I think there is a misunderstanding stemming from this post:

chuckbook wrote: ↑
Fri Sep 27, 2019 11:13 am
Very impressive! Thanks for sharing this. 1k FFT (SP) in ~0.8ms on PYBD, not bad.

In the original post, I quoted a measurement of 1.948 ms, and claimed that the FFT could be gotten in less than 2 ms, and not 0.8 ms. So, it is only a factor of two in speed, and basically no overhead in RAM, because with the exception of a handful of temporary variables, the transform happens in place. It is also true that I did not overclock the CPU, so if one is to be fair, then the gain is a bit more: if I extrapolate your numbers, if the CPU is clocked at 168 MHz, then the FFT in assembly would cost around 4.3 ms.

pythoncoder wrote: ↑
Sat Sep 28, 2019 7:11 am
But the core algorithm clearly needs rewriting in C: not only would it be faster, it would be portable.

If I remember correctly, your implementation was truly recursive (i.e., the original Cooley-Tukey), and as @chuckbook pointed out, the recursion in fft_kernel is removed by swapping the elements (bit shifting). I bet, if you implemented that in assembly, your numbers would be much reduced. I don't know how expensive it is to dispatch the sub-routines. On the other hand, the assembly code worked with 16-bit integers, didn't it? Which in itself is not necessarily a drawback, given that the results of the ADC are also integers.

The portability is another issue, and there I see a clear advantage of C. Or, perhaps, it is not a question of C vs. assembly per se, but the more obvious interface. That, however, could probably be changed in the python code, too.

Zoltán

pythoncoder · Post by **pythoncoder** » Thu Oct 03, 2019 7:07 am

v923z wrote: ↑
Wed Oct 02, 2019 1:35 pm
On the other hand, the assembly code worked with 16-bit integers, didn't it? Which in itself is not necessarily a drawback, given that the results of the ADC are also integers.

That was a very old version. It used 32 bit integers, and dated to a time when the inline assembler didn't support floating point. Integer DFT is not ideal: integer processing introduces cumulative errors especially on larger transforms. When we extended the assembler support to provide FP I rewrote the DFT to use it. Timings are based on that version.

It does indeed use the recursive algorithm and I made no effort at optimisation: I just wanted something very much faster than Python

Thank you for the clarification re timings, which now make a lot more sense to me

v923z · Post by **v923z** » Thu Oct 03, 2019 10:11 am

Peter,

pythoncoder wrote: ↑
Thu Oct 03, 2019 7:07 am

v923z wrote: ↑
Wed Oct 02, 2019 1:35 pm
On the other hand, the assembly code worked with 16-bit integers, didn't it? Which in itself is not necessarily a drawback, given that the results of the ADC are also integers.
That was a very old version. It used 32 bit integers, and dated to a time when the inline assembler didn't support floating point. Integer DFT is not ideal: integer processing introduces cumulative errors especially on larger transforms. When we extended the assembler support to provide FP I rewrote the DFT to use it. Timings are based on that version.

You have to update your readme: it still states 12 ms as the execution time of a 1024-point FFT.

It performs a 256 sample conversion in 2.5mS. This compares with 115mS for a floating point conversion in pure Python and 90mS using the native code emitter. A 1024 point conversion takes 12mS.

One more comment: originally I also had the recursive algorithm, but that cost around 20 ms with 1024 points, and it wasn't very economical with the RAM, either. True, I didn't pre-compute the twiddle factors, but still, it was a bit of a disappointment.

Cheers,
Zoltán

MicroPython Forum (Archive)

ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

FFT speed

Re: ulab, or what you will - numpy on bare metal

Re: ulab, or what you will - numpy on bare metal

Re: FFT speed

Re: FFT speed

Re: FFT speed