How to combine bytearrays fast ?

General discussions and questions abound development of code with MicroPython that is not hardware specific.
Target audience: MicroPython Users.
fstengel
Posts: 55
Joined: Tue Apr 17, 2018 4:37 pm

Re: How to combine bytearrays fast ?

Post by fstengel » Fri Feb 14, 2020 8:36 am

I just tried using the viper emitter on my ESP32. It seems that range(240) is strongly optimized. The following three functions

Code: Select all

@micropython.viper
def merge3(a:ptr8, b:ptr8, c:ptr8, d:ptr8, result:ptr8)->ptr8:
    iteration_list = range(0,240)
    for i in iteration_list:
        j = int(i)
        result[j] = a[j] + b[j] + c[j] + d[j]
    return(result)

@micropython.viper
def merge4(a:ptr8, b:ptr8, c:ptr8, d:ptr8, result:ptr8)->ptr8:
    for i in range(240):
        j = int(i)
        result[j] = a[j] + b[j] + c[j] + d[j]
    return(result)

@micropython.viper
def merge5(a:ptr8, b:ptr8, c:ptr8, d:ptr8, result:ptr8)->ptr8:
    j = int(0)
    while j<240:
        result[j] = a[j] + b[j] + c[j] + d[j]
        j += 1
    return(result)
Run in roughly 890us, 195us and 190us. The time difference seems to be the time it takes to build the iterator and to use it. the direct for i in range(240): looks like it is a direct synonym of the while loop (i.e. using a counter, not an iterator)

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: How to combine bytearrays fast ?

Post by jimmo » Fri Feb 14, 2020 10:19 am

fstengel wrote:
Fri Feb 14, 2020 8:36 am
It seems that range(240) is strongly optimized.
Yup. It's specifically the "for i in range():" that is optimized. Calling range on its own behaves like normal. This applies to all emitters (bytecode, native, viper). See compile_for_stmt_optimised_range in compile.c
OutoftheBOTS_ wrote:
Fri Feb 14, 2020 6:53 am
Correct if I am wrong but it is the range(240) that takes up all the time because it doesn't create a list that it iterates...If you create a local list then it should run faster but use more memory.
range() (in Python 3) doesn't create a list -- it's a generator. (Back in Python 2, this was the difference between range() and xrange() -- This is why everyone had it drilled into them to use "for x in xrange()" in Python 2).

In general (for a generator function that isn't "range"), the following two snippets are identical:

Code: Select all

g = generator()
for i in g:
  ...

Code: Select all

for i in generator():
  ...
No list is created in either case so the memory usage is the same.

But, see above, range() is special, so the former snippet is actually much _worse_ when using range().
OutoftheBOTS_ wrote:
Fri Feb 14, 2020 6:53 am
through but rather is a global function that has to be called each iteration of the for loop.
The function isn't called for each iteration of the loop, rather each iteration pulls the next value out of the iterator. You're right that calling a global function is expensive, but advancing the iterator isn't so bad. (And again, this doesn't apply for range() ).

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: How to combine bytearrays fast ?

Post by jimmo » Fri Feb 14, 2020 10:33 am

If you're very curious, this is the optimisation in action:

Code: Select all

for i in range(10):
  print(i)
goes to

Code: Select all

00 LOAD_CONST_SMALL_INT 0
01 JUMP 27
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME __repl_print__ (cache=0)
12 LOAD_NAME print (cache=0)
16 LOAD_NAME i (cache=0)
20 CALL_FUNCTION n=1 nkw=0
22 CALL_FUNCTION n=1 nkw=0
24 POP_TOP
25 LOAD_CONST_SMALL_INT 1
26 BINARY_OP 14 __iadd__
27 DUP_TOP
28 LOAD_CONST_SMALL_INT 10
29 BINARY_OP 0 __lt__
30 POP_JUMP_IF_TRUE 4
33 POP_TOP
34 LOAD_CONST_NONE
35 RETURN_VALUE

which is pretty close to what you get from

Code: Select all

i = 0
while i < 10:
  print(i)
  i += 1
which becomes

Code: Select all

00 LOAD_CONST_SMALL_INT 0
01 STORE_NAME i
04 JUMP 33
07 LOAD_NAME __repl_print__ (cache=0)
11 LOAD_NAME print (cache=0)
15 LOAD_NAME i (cache=0)
19 CALL_FUNCTION n=1 nkw=0
21 CALL_FUNCTION n=1 nkw=0
23 POP_TOP
24 LOAD_NAME i (cache=0)
28 LOAD_CONST_SMALL_INT 1
29 BINARY_OP 14 __iadd__
30 STORE_NAME i
33 LOAD_NAME i (cache=0)
37 LOAD_CONST_SMALL_INT 10
38 BINARY_OP 0 __lt__
39 POP_JUMP_IF_TRUE 7
42 LOAD_CONST_NONE
43 RETURN_VALUE

OutoftheBOTS_
Posts: 847
Joined: Mon Nov 20, 2017 10:18 am

Re: How to combine bytearrays fast ?

Post by OutoftheBOTS_ » Fri Feb 14, 2020 11:02 pm

jimmo wrote:
Fri Feb 14, 2020 10:33 am
If you're very curious, this is the optimisation in action:

Code: Select all

for i in range(10):
  print(i)
goes to

Code: Select all

00 LOAD_CONST_SMALL_INT 0
01 JUMP 27
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME __repl_print__ (cache=0)
12 LOAD_NAME print (cache=0)
16 LOAD_NAME i (cache=0)
20 CALL_FUNCTION n=1 nkw=0
22 CALL_FUNCTION n=1 nkw=0
24 POP_TOP
25 LOAD_CONST_SMALL_INT 1
26 BINARY_OP 14 __iadd__
27 DUP_TOP
28 LOAD_CONST_SMALL_INT 10
29 BINARY_OP 0 __lt__
30 POP_JUMP_IF_TRUE 4
33 POP_TOP
34 LOAD_CONST_NONE
35 RETURN_VALUE

which is pretty close to what you get from

Code: Select all

i = 0
while i < 10:
  print(i)
  i += 1
which becomes

Code: Select all

00 LOAD_CONST_SMALL_INT 0
01 STORE_NAME i
04 JUMP 33
07 LOAD_NAME __repl_print__ (cache=0)
11 LOAD_NAME print (cache=0)
15 LOAD_NAME i (cache=0)
19 CALL_FUNCTION n=1 nkw=0
21 CALL_FUNCTION n=1 nkw=0
23 POP_TOP
24 LOAD_NAME i (cache=0)
28 LOAD_CONST_SMALL_INT 1
29 BINARY_OP 14 __iadd__
30 STORE_NAME i
33 LOAD_NAME i (cache=0)
37 LOAD_CONST_SMALL_INT 10
38 BINARY_OP 0 __lt__
39 POP_JUMP_IF_TRUE 7
42 LOAD_CONST_NONE
43 RETURN_VALUE
Thanks for your detailed explanation as this helps me understand whats happening much better. I didn't know that range() had this level of optimization which of course is very useful because of how often it is used in most code.

Both sets of compiled instructions you posted seem to loop over 14 lines to get the job done so I assume we would expect them to both take the same amount of time to complete??

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: How to combine bytearrays fast ?

Post by jimmo » Sat Feb 15, 2020 12:59 am

OutoftheBOTS_ wrote:
Fri Feb 14, 2020 11:02 pm
Both sets of compiled instructions you posted seem to loop over 14 lines to get the job done so I assume we would expect them to both take the same amount of time to complete??
At a high level ... yes, opcode dispatch has enough overhead relative to the actual work done that this isn't a bad approximation for the simple opcodes like add and jump etc.

When you start getting into LOAD_ATTR and LOAD_SUBSCR and stuff there's potentially a lot of stuff happening behind the scenes.

Post Reply