fastest way to fill large bytearrays? (Neopixel, APA102 et al)

hdsjulian · Post by **hdsjulian** » Wed Mar 06, 2019 4:22 pm

So i'm working on a project with 120 APA102 LEDs and it's insanely slow. Same goes for Neopixels. The problem isn't the writing itself (this takes about 2ms) but rather the filling / overwriting of the bytearray (which takes about 100ms for the 120*4 Values).
Any pointers on how i can speed this up?

hdsjulian · Post by **hdsjulian** » Wed Mar 06, 2019 5:03 pm

Update: This takes 42ms to run. is there any option to speed it up?

Code: Select all

import utime
ba = bytearray(120*4)
 
start = utime.ticks_us()
for i in range(120):
    ba[i*4+0] = 255
    ba[i*4+1] = 255
    ba[i*4+2] = 255
    ba[i*4+3] = 255
end = utime.ticks_us()
print("Total time: "+str(end-start))

Christian Walther · Post by **Christian Walther** » Wed Mar 06, 2019 6:59 pm

First, if this code wasn’t in a function, put it in one: local variables are faster than global ones.

This takes 9 ms on my ESP8266:

Code: Select all

def run():
    ba = bytearray(120*4)
    start = utime.ticks_us()
    for i in range(120):
        ba[i*4+0] = 255
        ba[i*4+1] = 255
        ba[i*4+2] = 255
        ba[i*4+3] = 255
    end = utime.ticks_us()
    print("Total time: "+str(end-start))

This is my first guess (get rid of those redundant multiplications) and takes 1.6 5.8 ms (edited: loop to 480, not 120):

Code: Select all

def run():
    ba = bytearray(120*4)
    start = utime.ticks_us()
    i = 0
    while i < 480:
        ba[i] = 255
        i += 1
        ba[i] = 255
        i += 1
        ba[i] = 255
        i += 1
        ba[i] = 255
        i += 1
    end = utime.ticks_us()
    print("Total time: "+str(end-start))

For more tips, see Damien’s talk on Writing Fast and Efficient MicroPython (from the PyCon AU topic).

dhylands · Post by **dhylands** » Wed Mar 06, 2019 8:04 pm

I think your while loop should be while i < 480. since 480 is the size of the bytearray. You want to do 120 iterations of the while loop, but it increments i by 4 for each iteration.

mattyt · Post by **mattyt** » Thu Mar 07, 2019 12:26 am

Also, if you are using APA102's you might want to take a look at my micropython-dotstar library. As discussed in this forum post, @bill-e was able to update 3000 APA102's in about a second using it. Hope that helps!

hdsjulian · Post by **hdsjulian** » Thu Mar 07, 2019 12:04 pm

mattyt wrote: ↑
Thu Mar 07, 2019 12:26 am
Also, if you are using APA102's you might want to take a look at my micropython-dotstar library. As discussed in this forum post, @bill-e was able to update 3000 APA102's in about a second using it. Hope that helps!

thanks, the problem isn't the updating, it's the writing of the values...

Christian Walther · Post by **Christian Walther** » Thu Mar 07, 2019 6:16 pm

Doh! Thanks, Dave. That’s what I get for trying to solve technical questions with the mushy brain that comes from a cold. With the correct loop bounds, it takes 5.8 ms.

dhylands · Post by **dhylands** » Thu Mar 07, 2019 7:09 pm

I did the following tests on a pyboard 1.0:

Code: Select all

import utime
ba = bytearray(120*4)

start = utime.ticks_us()
for i in range(120):
    ba[i*4+0] = 255
    ba[i*4+1] = 255
    ba[i*4+2] = 255
    ba[i*4+3] = 255
end = utime.ticks_us()
print('Total time: {:4d} straight python'.format(end - start))

@micropython.native
def fill_native(ba):
    for i in range(len(ba)):
        ba[i] = 255

start = utime.ticks_us()
ba2 = bytearray(120*4)
fill_native(ba2)
end = utime.ticks_us()
print('Total time: {:4d} native emitter'.format(end - start))

@micropython.viper
def fill_viper(ba: ptr8, ba_len: int):
    for i in range(ba_len):
        ba[i] = 255

start = utime.ticks_us()
ba3 = bytearray(120*4)
fill_viper(ba3, len(ba3))
end = utime.ticks_us()
print('Total time: {:4d} viper emitter - 1 byte at a time'.format(end - start))

@micropython.viper
def fill_viper4(ba: ptr32, ba_len: int):
    for i in range(ba_len):
        ba[i] = -1

start = utime.ticks_us()
ba4 = bytearray(120*4)
fill_viper4(ba4, len(ba4)//4)
end = utime.ticks_us()
print('Total time: {:4d} viper emitter - 4 bytes at a time'.format(end - start))

@micropython.asm_thumb
def fill_asm(r0, r1): # buf(r0) len(r1)
    mov(r2, 0xff)
    add(r1, r1, r0)   # buf_end(r1) = len(r1) + buf(r0)
    label(loop)
    cmp(r0, r1)
    bge(endloop)      # branch if buf(r0) >= buf_end(r1)
    strb(r2, [r0, 0]) # *buf++ = 0xff
    add(r0, 1)
    b(loop)
    label(endloop)

start = utime.ticks_us()
ba5 = bytearray(120*4)
fill_asm(ba5, len(ba5))
end = utime.ticks_us()
print('Total time: {:4d} asm - 1 byte at a time'.format(end - start))

@micropython.asm_thumb
def fill_asm4(r0, r1): # buf(r0) len(r1)
    movw(r2, 0xffff)
    movt(r2, 0xffff)
    add(r1, r1, r0)   # buf_end(r1) = len(r1) + buf(r0)
    label(loop)
    cmp(r0, r1)
    bge(endloop)      # branch if buf(r0) >= buf_end(r1)
    str(r2, [r0, 0])  # *buf++ = 0xffffffff
    add(r0, 4)
    b(loop)
    label(endloop)

start = utime.ticks_us()
ba6 = bytearray(120*4)
fill_asm4(ba6, len(ba6))
end = utime.ticks_us()
print('Total time: {:4d} asm - 4 bytes at a time'.format(end - start))

and got these results:

Code: Select all

>>> import test
Total time: 4750 straight python
Total time: 1579 native emitter
Total time:  229 viper emitter - 1 byte at a time
Total time:  135 viper emitter - 4 bytes at a time
Total time:  121 asm - 1 byte at a time
Total time:   82 asm - 4 bytes at a time

rcolistete · Post by **rcolistete** » Mon Mar 11, 2019 6:13 pm

On Pyboard D (SF2W), "MicroPython v1.9.4-925-g8edf1205f-dirty on 2019-01-16; PYBD_SF2 with STM32F722IEK".

Code of post #2 : 4.0 ms (120 MHz) / 2.4 ms (216 MHz) instead of 42 ms (? board).

Code of post #3, 1st code : 2.5 ms (120 MHz) / 1.5 ms (216 MHz) instead of 9 ms (ESP8266).

Code of dhylands's test @ 120 MHz :

Code: Select all

>>> import test
Total time: 4289 straight python
Total time: 1287 native emitter
Total time:  144 viper emitter - 1 byte at a time
Total time:  126 viper emitter - 4 bytes at a time
Total time:   96 asm - 1 byte at a time
Total time:   84 asm - 4 bytes at a time

@216 MHz :

Code: Select all

>>> machine.freq(216000000)
>>> import test
Total time: 2471 straight python
Total time:  739 native emitter
Total time:   99 viper emitter - 1 byte at a time
Total time:   79 viper emitter - 4 bytes at a time
Total time:   61 asm - 1 byte at a time
Total time:   48 asm - 4 bytes at a time

So Pyboard D is :
- a lot faster than ESP8266;
- @ 120 MHz a little bit faster than Pyboard v1.0/1.1 in almost all tests, @ 216 MHz a lot faster than Pyboard v1.0/1.1.

Incredible to have native, viper and asm decorators already working on Pyboard D, with high performance !

pythoncoder · Post by **pythoncoder** » Wed Mar 13, 2019 9:34 am

The other take-away from these tests is the performance of Viper. A penalty relative to assembler of well under 2 is astounding. Perhaps the figures for the faster options are dominated by a fixed overhead in doing a function call.

MicroPython Forum (Archive)

fastest way to fill large bytearrays? (Neopixel, APA102 et al)

fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Re: fastest way to fill large bytearrays? (Neopixel, APA102 et al)

Viper performance