Fast/efficient alternative to bytearray.index()?

martincho · Post by **martincho** » Thu Jun 09, 2022 6:24 pm

Couldn't leave it alone...
This version is about 12% faster.

I can't think of a way to optimize the routine further than this. Pre-computing the function's address before running it in the loop makes each search 1 microsecond faster. That's about it, I think. Is there a faster way to pass the bytearray address, length and value?

Code: Select all

# Assembler...take 2
@micropython.asm_thumb
def look_for6(r0, r1, r2) -> int:
    # r0 bytearray address
    # r1 bytearray length
    # r2 value to search for
    add(r3, r0, r1)     # Pre-calculate last address
    mov(r4, r0)         # Working pointer starts at first byte
    label(SEARCH)       # Start of search loop
    ldrb(r1, [r4, 0])   # Get next value from memory
    cmp(r2, r1)         # Match?
    beq(DONE)           # Yes. Done.
    add(r4, 1)          # No. Increment pointer
    cmp(r4, r3)         # If we are not past the end of the bytearray, keep going
    ble(SEARCH)         #
    mov(r0, 0)          # Not found. Return -1
    sub(r0, 1)          #
    b(END)              #
    label(DONE)         # Return the index where we got a match
    sub(r0, r4, r0)     #
    label(END)          # That's all!

Code: Select all

Worst case search target, 1000 iterations
Time 0:      231.0 ms
Time 1:      106.0 ms
Time 2:       92.0 ms
Time 3:       91.0 ms
Time 4:       91.0 ms
Time 2a:      90.0 ms
Time 5:       32.0 ms
Time 6:       28.0 ms

Random search target, 1000 iterations
Time 0:      223.0 ms
Time 1:       98.0 ms
Time 2:       85.0 ms
Time 3:       85.0 ms
Time 4:       85.0 ms
Time 2a:      84.0 ms
Time 5:       30.0 ms
Time 6:       26.0 ms

martincho · Post by **martincho** » Fri Jun 10, 2022 5:39 am

Just for the fun of it. Here's another version.

This one lets you search within a subsection of the bytearray(). This is useful if this is something like a data or communications buffer that you add packets of data to regularly and need to search for specific values. Instead of searching the entire bytearray every single time you can keep track of the location of the new packet and just search that section, which will be massively faster.

Code: Select all

# Fast search for a value in a bytearray()
# MicroPython does not currently support .index() on bytearrays..
# This version allows searching a subsection of the bytearray by using start/end indeces
# r0 bytearray address
# r1 start_index
# r2 end_index
# r3 value to search for
#
# Returns index if found or -1 if not
#
# Usage:
# ba = bytearray(range[n])
# index_of(ba, start_index, end_index, number to search for)
#
@micropython.asm_thumb
def index_of(r0, r1, r2, r3) -> int:
    cmp(r1, r2)         # If they are equal, return not found
    beq(NOT_FOUND)      #
    add(r1, r0, r1)     # Pointer to head
    add(r2, r0, r2)     # Pointer to tail
    label(SEARCH)       # Start of search loop
    ldrb(r5, [r1, 0])   # Get next value from memory
    cmp(r3, r5)         # Match?
    beq(DONE)           # Yes. Done.
    add(r1, 1)          # No. Increment pointer
    cmp(r1, r2)         # If we are not past the end of the bytearray, keep going
    bne(SEARCH)         #
    label(NOT_FOUND)    # Not found. Return -1
    mov(r0, 0)          #
    sub(r0, 1)          #
    b(END)              #
    label(DONE)         # Return the index where we got a match
    sub(r0, r1, r0)     #
    label(END)          # That's all!

jim · Post by **jim** » Fri Jun 10, 2022 6:43 pm

martincho wrote: ↑
Thu Jun 09, 2022 5:07 pm
Anyone know how to get PyCharm to understand MicroPython assembler and not torture me with this view?

Code works fine, it's just PyCharm that isn't happy at all.

Add the comment

Code: Select all

# type: ignore

near the top of the file. Should work with most editors, though I haven't used PyCharm.

rkompass · Post by **rkompass** » Fri Jun 10, 2022 7:24 pm

Hello martincho,

thank you. Nice stuff, I enjoyed reading the code and learnt a bit too. The last asm_thumb code appears quite optimal to me.
The timing of one search should be in us, because you have 1000 iterations. So we have 27 us. That is not only the assembler but also the
iteration in uPy.
With

Code: Select all


@micropython.viper
def time1000_lf6(ba:ptr8, size:int, target:int):
    start = time.ticks_ms()
    for test in range(1000):
        look_for6(ba, size, target)
    end = time.ticks_ms()
    return time.ticks_diff(end, start)

tdiff = time1000_lf6(ba, len(ba), 255)
print(f"Time 6: {tdiff:10.1f} us")

I get 21 us.
Some years ago there was a discussion here, that any function call, even asm_thumb or viper takes a few (iirc 4 on pyboard) microseconds.
Assuming that value we have 17 us for the asm loop. Dividing by 256, one iteration then should be about 66 ns.
Assuming 128 MHz clock of the RPi Pico the 6 asm instructions in your code should take 48 ns.
I don`t know where the last difference comes from, but anyway, the result seems quite optimal.

Greetings,
Raul

martincho · Post by **martincho** » Fri Jun 10, 2022 8:26 pm

Thanks Raul. It's been fun. I haven't touched assembler in probably 15 to 20 years. Last I did was a project using an 8-bit 8051F020 processor. It was a very substantial bit of software. Six months later I took the pain and ported it over to C. It was just impossible to maintain and extend. Love assembler for performance, but have managed to avoid it for a while. I am working on a device that is running multiple simultaneous channels of communications, so every ns counts. Truthfully, I should have done this in C. I am now in too deep to move away from MicroPython. It would have been a LOT easier.

Regarding timing. I got lazy and just displayed time for all 1000 iterations. You are right, the time for a single run is in the order of 1000x less than that, give or take contact with reality here and there.

One of the things that is truly affecting achievable performance is the garbage collector, one of the things you don't have to deal with in C. You just don't want random garbage collection in the middle of handling communications. Even with buffers things can step over each other quickly. By coding a bunch of the array handling into assembler with pre-allocated bytearrays() I can avoid a bunch of that.

The next step is to turn off garbage collection and integrate manual control into the application's state machine. From a cursory look, it seems that garbage collection takes about 5 microseconds. That's a long time.

pythoncoder · Post by **pythoncoder** » Sat Jun 11, 2022 7:29 am

martincho wrote: ↑
Fri Jun 10, 2022 8:26 pm
...
From a cursory look, it seems that garbage collection takes about 5 microseconds. That's a long time.

I'm afraid it's much worse than that. 1ms is about the minimum on Pyboards. With care, by doing a collect regularly when a program is not busy, it can be kept in the very low ms.

On platforms like ESP32 with SPIRAM the minimum is 100ms

martincho · Post by **martincho** » Sat Jun 11, 2022 7:50 am

This is on an RP2040. I am seeing about 5 ms fairly consistently when I call gc.collect() myself. This has been true with both very simple test code and in the context of a far more complex application with lots of data moving around.

The silly test:

Code: Select all

    while True:
         a = bytearray(range(200))
         b = bytearray(range(200))
         c = bytearray(range(200))
         d = bytearray(range(200))
         e = bytearray(range(200))
         f = bytearray(range(200))
         del(a)
         del(b)
         del(c)
         del(d)
         del(e)
         del(f)
         a0.value(1)
         gc.collect()
         a0.value(0)

a0 is an output pin connected to the logic analyzer.

Pulse measurement:

rkompass · Post by **rkompass** » Sat Jun 11, 2022 7:31 pm

Hello Martincho and Peter,

the last time I got somewhat deeper involved with assembler was on a Z-80 self-built machine, which 2 KB Rom, 2 KB Ram and 1 KB of text-based TV output. I managed to program a sieve of Eratosthenes and was impressed how fast (quite immediately) the primes appeared on the screen. Then I was drawn to army (at age 18). I had engineered a power circuit for the CMOS Ram that held this program probably the next 15 years from 2 battery cells, until I threw the whole thing into the trash bin.
Now I regret that sometimes. Would have been nice to show it to my son.
This was in East Germany at a time when there was no computer available for private purchase.
I have no proof now that I did this, just the memory.
I learnt C on a ZX Spectrum two years later, still in East Germany, and forgot about assembler.
But in my memory the Z-80 was a perfect CPU and I regretted very much that the 8086 won the race with the IBM-PC.
Only now I learnt about the political background for this via Youtube.
So much for memories..

I looked up the clock cycles for the asm_thumb loop at https://developer.arm.com/documentation ... b/CHDDIGAC:

Code: Select all

    instruction       clock cycles
    -------------------------------
    label(SEARCH)
    ldrb(r1, [r4, 0])     2
    cmp(r2, r1)           1
    beq(DONE)             1 + 0    (no branching)           
    add(r4, 1)            1      
    cmp(r4, r3)           1    
    ble(SEARCH)           1 + 1    (branching)

So we have 8 clock cycles of 8 ns => 64 ns cycle duration, matching nicely the measured 66 ns.

By the way, you can set the RPi Pico to 250 Mhz. Then the full search through 256 bytes takes only 11 us.

Regarding garbage collection, what about a GC that does it incrementally with defined maximum blocking times? Is this theoretically possible? Has it been discussed already?

martincho · Post by **martincho** » Sun Jun 12, 2022 10:20 pm

rkompass wrote: ↑
Sat Jun 11, 2022 7:31 pm
Hello Martincho and Peter,

the last time I got somewhat deeper involved with assembler was on a Z-80 self-built machine, which 2 KB Rom, 2 KB Ram and 1 KB of text-based TV output. I managed to program a sieve of Eratosthenes and was impressed how fast (quite immediately) the primes appeared on the screen. Then I was drawn to army (at age 18). I had engineered a power circuit for the CMOS Ram that held this program probably the next 15 years from 2 battery cells, until I threw the whole thing into the trash bin.
Now I regret that sometimes. Would have been nice to show it to my son.
This was in East Germany at a time when there was no computer available for private purchase.
I have no proof now that I did this, just the memory.
I learnt C on a ZX Spectrum two years later, still in East Germany, and forgot about assembler.
But in my memory the Z-80 was a perfect CPU and I regretted very much that the 8086 won the race with the IBM-PC.
Only now I learnt about the political background for this via Youtube.
So much for memories..

I looked up the clock cycles for the asm_thumb loop at https://developer.arm.com/documentation ... b/CHDDIGAC:
Code: Select all
    instruction       clock cycles
    -------------------------------
    label(SEARCH)
    ldrb(r1, [r4, 0])     2
    cmp(r2, r1)           1
    beq(DONE)             1 + 0    (no branching)           
    add(r4, 1)            1      
    cmp(r4, r3)           1    
    ble(SEARCH)           1 + 1    (branching)
So we have 8 clock cycles of 8 ns => 64 ns cycle duration, matching nicely the measured 66 ns.

By the way, you can set the RPi Pico to 250 Mhz. Then the full search through 256 bytes takes only 11 us.

Regarding garbage collection, what about a GC that does it incrementally with defined maximum blocking times? Is this theoretically possible? Has it been discussed already?

I appreciate your comment (and the cycle counting!). My family lived in South America when I was young and just getting started. It wasn't East Germany, but when something like an 8080 processor (not a working computers, just the chip) costs a ridiculous amount of money you learn to get creative. My first processor was a 4 bit processor I made myself out of TTL (don't remember if it was LS) chips. I created the instruction set using a hard-wired discrete diode ROM for the instruction decoder. Fabulous learning experience. From there I think I went to a 6502, where I implemented Forth, designed and built a floppy drive controller (back then you had to design the analog front end as well!) and then wrote my own code editor to be able to work on a (real) VT-100 terminal and edit some code.

Today's programmers don't know what they missed. I absolutely loved the journey. Understanding things at such low levels is hugely important and valuable. I have not touched assembly language for decades. As soon as my MicroPython project ran into execution time issues and I discovered I could resort to assembler, it all came back. I just had to start learning ARM assembler, which isn't all that difficult. Once you understand how to build programs in assembler, you don't really forget. It's like riding a bike and 32 bit registers are such a luxury!

MicroPython Forum (Archive)

Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?

Re: Fast/efficient alternative to bytearray.index()?