Writing fast and efficient MicroPython (extension)

General discussions and questions abound development of code with MicroPython that is not hardware specific.
Target audience: MicroPython Users.
Post Reply
ChrisO
Posts: 15
Joined: Mon Apr 06, 2020 6:16 pm

Writing fast and efficient MicroPython (extension)

Post by ChrisO » Thu Nov 05, 2020 11:05 am

Hi All,

TL;DR. why is the code at the bottom so dramatically slow compared to Damien's?

context:
---
I've watched this very interesting talk by Damien George https://www.youtube.com/watch?v=hHec4qL ... el=PyConAU.

It shows an easy performance test for rapidly blinking an LED in micropython.
Initial code, that basically anyone would judge be be "OK" code runs at around 50khz which is slow.
With a few easy tricks this can be easily increased to 150khz (see result of iteration 3 at timestamp 20:22)
---

I've recreated the tests to see how they perform on my esp32 and the results are very similar:

test1: 4.959 sec, 24.794 usec/blink : 40.33 kblinks/sec
test3: 0.923 sec, 4.614 usec/blink : 216.74 kblinks/sec

now I've got a real life use case for a controller which has to do three things simultaneously:
1. expose an HTTP api with endpoints to control/read several sensors.
2. run some logic which regulates a few outputs based on sensor readings.
3. relay info with MQTT to a remote server

Obviously, coming from a java background, I've spun up a webserver (picoweb) and have created a +- 20 classes for various sensors and an MQTT client. All this works now, and the initial POC is sort of complete.... next up is improving upon it.

I wanted to learn about the performance implications uasyncio has on all of this. so I recreated Damien's test3, made it asynchronous and popped in a single sleep_ms(0) to see what would happen. The results were dramatic:

test7: 116.883 sec, 584.416 usec/blink : 1.74 kblinks/sec

I did expect this to run slower, but not 127 times slower. I'd like to know what you all think about this.
Is a test like this relevant? Can we improve upon this further without resorting to a viper-like solution?

Code: Select all

import time
from machine import Pin
import uasyncio


class SpecialPin():
    PIN = 2

    def __init__(self):
        self.led = Pin(self.PIN, Pin.OUT)
        self.on = self.led.on
        self.off = self.led.off


led = SpecialPin()
N = 200_000


async def blink_simple(n):
    on = led.on
    off = led.off
    r = range(n)
    sleep = uasyncio.sleep_ms
    for i in r:
        on()
        await sleep(0)
        off()


async def time_it(f, n):
    ticks_us = time.ticks_us
    t0 = ticks_us()
    await f(n)
    t1 = ticks_us()
    dt = time.ticks_diff(t1, t0)
    fmt = '{:5.3f} sec, {:6.3f} usec/blink : {:8.2f} kblinks/sec'
    print(fmt.format(dt * 1e-6, dt / N, N / dt * 1e3))


uasyncio.run(time_it(blink_simple, N))
Chris

ChrisO
Posts: 15
Joined: Mon Apr 06, 2020 6:16 pm

Re: Writing fast and efficient MicroPython (extension)

Post by ChrisO » Thu Nov 05, 2020 11:35 am

Additional note:

I do not think it is the asynchronous nature of the method which is causing the issue.

if you adjust the method to look like below the result is still good:

test7_single_sleep: 0.923 sec, 4.617 usec/blink : 216.59 kblinks/sec

Code: Select all

async def blink_simple(n):
    on = led.on
    off = led.off
    r = range(n)
    sleep = uasyncio.sleep_ms
    await sleep(0)
    for i in r:
        on()
        off()
Chris

stijn
Posts: 546
Joined: Thu Apr 24, 2014 9:13 am

Re: Writing fast and efficient MicroPython (extension)

Post by stijn » Thu Nov 05, 2020 11:47 am

What do you expect sleep(0) to do? I.e. I don't know how it's implement in MicroPython, but sleep(0) usually does not mean 'do not sleep' but more like 'yield', and that can take any amount of time. In your case apparently half a mSec. So if it's about efficiency you wouldn't normally use that.

ChrisO
Posts: 15
Joined: Mon Apr 06, 2020 6:16 pm

Re: Writing fast and efficient MicroPython (extension)

Post by ChrisO » Thu Nov 05, 2020 12:06 pm

stijn wrote:
Thu Nov 05, 2020 11:47 am
What do you expect sleep(0) to do? I.e. I don't know how it's implement in MicroPython, but sleep(0) usually does not mean 'do not sleep' but more like 'yield', and that can take any amount of time. In your case apparently half a mSec. So if it's about efficiency you wouldn't normally use that.
Hi Stijn,
Thanks for responding.

You're right about the fact that it does not make any sense to just randomly sleep/yield within a function when you're concerned about
its efficiency. With sleep_ms(0) I'm trying to convey that this specific task, does not necessarily have priority over any other scheduled task.
So it is a way of saying, I'm yielding to the control loop so that it can decide what needs to happen next before I finish this function.

I understand that it's weird to ask about efficiency of a method while I'm also directly trying to say it has no priority... but my thought was that it seems futile to think about or implement optimisations when the switching of tasks with uasyncio is such a huge bottleneck.

In this example, even though I yield, there's no other task scheduled, so I sort of expected it to be faster.
Chris

ChrisO
Posts: 15
Joined: Mon Apr 06, 2020 6:16 pm

Re: Writing fast and efficient MicroPython (extension)

Post by ChrisO » Thu Nov 05, 2020 12:25 pm

It might be possible that the test I have created is just so illogical that it distracts from my actual goal, which is to optimise my code.
What I am trying to do here, is to create a small example, learn about it and apply it to my project somehow.
Unfortunately I cannot share this right now because it's for a work related project.

In essence I have a main.py which does create 2 scheduled tasks:
a picoweb: app.run
a custom class SensorServer with a run method

Code: Select all

# inside SensorServer
    async def run(self):
        try:
            [device.run() for device in self.devices.values() if isinstance(device, Sensor)]
            asyncio.get_event_loop().run_forever()
        except KeyboardInterrupt:
            [device.stop() for device in self.devices.values() if isinstance(device, Sensor)]
each of those device.run() look something like this:

Code: Select all

    def run(self):
        self.running = True

        loop = asyncio.get_event_loop()
        loop.create_task(self._keep_running())

        self.logger.info('Started {}'.format(self.name))
The keep_running is just a while look with sleep(X) inside specific to how often we want that loop the run.
i.e. a device could be a oneWire temp sensor which we would like to measure every 5 seconds.
another device could be a door sensor which we want to measure every 2 seconds.

It all seems to work, but it is rather sluggish and with this project specifically I have no clue where to start optimising or how to ask you the right questions with an appropriate small scope.
Chris

User avatar
jimmo
Posts: 2072
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: Writing fast and efficient MicroPython (extension)

Post by jimmo » Fri Nov 06, 2020 2:37 am

ChrisO wrote:
Thu Nov 05, 2020 12:06 pm
With sleep_ms(0) I'm trying to convey that this specific task, does not necessarily have priority over any other scheduled task.
So it is a way of saying, I'm yielding to the control loop so that it can decide what needs to happen next before I finish this function.
Unfortunately sleep(0) has to do some work to decide "oh there's nothing else to do, keep going". It's not a huge amount of work, but relative to "nothing" it easily accounts for the performance hit you're seeing.
ChrisO wrote:
Thu Nov 05, 2020 11:35 am
r = range(n)
sleep = uasyncio.sleep_ms
await sleep(0)
for i in r:
FWIW, it's been a while since I saw the talk you're referencing so I can't remember if this was mentioned, but it's worth noting that there's a very specific caveat to the "looking up globals is slow", which is that "for i in range(a, b)" is specifically handled by the MicroPython compiler to optimise out the range() and turn it into the equivalent of "for (int i = a; i < b; ++i)" in C. So by splitting out the "r = range()" and then "for i in r:" you actually lose this optimisation.

User avatar
pythoncoder
Posts: 4805
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Writing fast and efficient MicroPython (extension)

Post by pythoncoder » Fri Nov 06, 2020 8:13 am

As a general point in most practical uasyncio applications, when you issue sleep_ms(0) you will experience a pause of several ms. This is because other tasks will be scheduled in round-robin fashion. Further, the delay will typically be variable depending on what the other tasks do between yields. That is just the nature of cooperative scheduling.

If you really want to benchmark schedulers I wrote this script which currently shows ~220μs latency on a Pyboard 1.0. What I find impressive is that it will run 1000 concurrent tasks. Not that I'd advocate anything like that number...

The takeaway from that benchmark is that the scheduler overhead is maybe two orders of magnitude less than the typical application delay. This is impressive in my view.
Peter Hinch

ChrisO
Posts: 15
Joined: Mon Apr 06, 2020 6:16 pm

Re: Writing fast and efficient MicroPython (extension)

Post by ChrisO » Fri Nov 06, 2020 8:39 am

jimmo wrote:
Fri Nov 06, 2020 2:37 am
ChrisO wrote:
Thu Nov 05, 2020 11:35 am
r = range(n)
sleep = uasyncio.sleep_ms
await sleep(0)
for i in r:
FWIW, it's been a while since I saw the talk you're referencing so I can't remember if this was mentioned, but it's worth noting that there's a very specific caveat to the "looking up globals is slow", which is that "for i in range(a, b)" is specifically handled by the MicroPython compiler to optimise out the range() and turn it into the equivalent of "for (int i = a; i < b; ++i)" in C. So by splitting out the "r = range()" and then "for i in r:" you actually lose this optimisation.
Damien explicitly glossed over it and invited the audience to ask about it if they'd like to know more.
Unfortunately nobody did.

I'm not on the latest commit and currently when I swap out

Code: Select all

for i in r

(0.923 sec, 4.614 usec/blink : 216.74 kblinks/sec)
for something like

Code: Select all

for i in range(0,n)
(1.157 sec, 5.786 usec/blink : 172.83 kblinks/sec)

I definitely lose some performance.

I'll try to update my project to work with a more recent version of micropython and come back on this later on.

thanks for the reply
Chris

ChrisO
Posts: 15
Joined: Mon Apr 06, 2020 6:16 pm

Re: Writing fast and efficient MicroPython (extension)

Post by ChrisO » Fri Nov 06, 2020 9:05 am

jimmo wrote:
Fri Nov 06, 2020 2:37 am
Unfortunately sleep(0) has to do some work to decide "oh there's nothing else to do, keep going". It's not a huge amount of work, but relative to "nothing" it easily accounts for the performance hit you're seeing.
pythoncoder wrote:
Fri Nov 06, 2020 8:13 am
As a general point in most practical uasyncio applications, when you issue sleep_ms(0) you will experience a pause of several ms. This is because other tasks will be scheduled in round-robin fashion. Further, the delay will typically be variable depending on what the other tasks do between yields. That is just the nature of cooperative scheduling.
I probably sprinkled it a bit too much then. Thank you both for the insight.
pythoncoder wrote:
Fri Nov 06, 2020 8:13 am
If you really want to benchmark schedulers I wrote this script which currently shows ~220μs latency on a Pyboard 1.0. What I find impressive is that it will run 1000 concurrent tasks. Not that I'd advocate anything like that number...

The takeaway from that benchmark is that the scheduler overhead is maybe two orders of magnitude less than the typical application delay. This is impressive in my view.
I don't really want to benchmark schedulers, but learn what impact they have and how to use them best.
Having said that, I played around with it for a few minutes anyway :)

I ran your script on an ESP32-WROOM-32 and got the following result:
original_test_scheduler.py wrote: Testing 100 coros for 2secs
Testing 200 coros for 2secs
Testing 500 coros for 2secs
Testing 1000 coros for 2secs
Coros 100 Iterations/sec 2060 Duration 485us
Coros 200 Iterations/sec 2065 Duration 484us
Coros 500 Iterations/sec 2027 Duration 493us
Coros 1000 Iterations/sec 2099 Duration 476us
I then applied a few of those optimisations mentioned in the talk and got a 6% increase! :)
optimized_test_scheduler.py wrote: Testing 100 coros for 2secs
Testing 200 coros for 2secs
Testing 500 coros for 2secs
Testing 1000 coros for 2secs
Coros 100 Iterations/sec 2187 Duration 457us
Coros 200 Iterations/sec 2174 Duration 459us
Coros 500 Iterations/sec 2120 Duration 471us
Coros 1000 Iterations/sec 2173 Duration 460us
I really did not change much, but the result was satisfying, see this gist.
https://gist.github.com/chrisovergaauw/ ... 131824ee70
Chris

User avatar
pythoncoder
Posts: 4805
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

The scheduler benchmark

Post by pythoncoder » Fri Nov 06, 2020 3:54 pm

Interesting. A bit of history...

I wrote the benchmark six years ago, when MicroPython was young. At the time uasyncio V1 was unusable, lacking any capability of millisecond class scheduling. So I wrote my own scheduler: the BM was intended to test it.

My scheduler enabled me to pursue a number of projects. Because I knew nothing about optimising MP code it was inefficient, performing allocations with wild abandon. By contrast uasyncio V2 was streets ahead, being written by one of the maintainers. I scrapped my scheduler and ported the BM.

I can see opportunities for improving the BM, the only downside is it would render comparisons with old figures meaningless.
Peter Hinch

Post Reply