optimizing uasyncio performance

All ESP32 boards running MicroPython.
Target audience: MicroPython users with an ESP32 board.
User avatar
Mike Teachman
Posts: 45
Joined: Mon Jun 13, 2016 3:19 pm
Location: Victoria, BC, Canada

Re: optimizing uasyncio performance

Post by Mike Teachman » Mon Dec 10, 2018 4:59 pm

pythoncoder wrote:
Sat Dec 08, 2018 4:51 pm
I have doubts about uasyncio coping with i2s. It depends on how the i2s code is implemented. uasyncio can handle data streams on UARTs because the UART interface uses interrupts and buffering. So the buffer is filled below the radar of uasyncio, and so long as the latter is emptied quicker than the ISR fills it, all is well, even with multiple competing coros. The buffer size needs to be big enough to cope with the worst-case latency imposed by the other coros.

A few years ago someone tried to implement I2S on the Pyboard using DMA but never managed to get it working without glitches. How does the Loboris version work? Is the buffer size configurable?
Note...possible TLDR situation below... but it is not a one-paragraph discussion wrt I2s and uasyncio

In many cases it will be possible to have I2S work with uasyncio on the ESP32. I'll refer to the Machine.I2S implementation I recently finished on a Loboris fork Lobo fork with I2S, C file here: Machine.I2S class (Note that the Loboris port does not yet have I2S support)

I2S in the ESP-IDF works with DMA buffering. During instantiation in Machine.I2S, the size and number of DMA buffers is defined. Once started, the ESP32 I2S implementation continually reads samples from an I2S device and fills DMA buffers even when something like uasyncio is taking cpu time. There are two design constraints to making "gapless sample processing work". Best, shown by example:

Let's say you want to process audio samples from the Adafruit MEMS Microphone at a CD quality sampling rate of 44.1 kHz. You create an uasyncio task to call the Machine.I2S.readinto() method. In this example, the readinto() method will request 256 audio samples each call. The I2S microphone Adafruit uses in their breakout board encodes each audio sample as 8 bytes (L+R channels - although only one channel is used, 32 bit samples). Say you request 2048 bytes (256 samples x 8 bytes/sample). On my Lobo build, each call to the readinto() method takes on average 410 us to read 2048 bytes into a bytearray() buffer. The time for the physical I2S port to read 256 samples from the microphone is 5.8 ms (256/44.1kHz). The 5.8 ms and 410us define Design Constraint #1 - all other Coroutines must yield to the I2S task, on average, within 5.8 ms - 410us. Otherwise, in the long run, samples will get dropped and the processing of audio samples will not be gapless.

Design Constraint #2 focuses on DMA buffering and is the maximum amount of time that all other Coroutines can run before samples get dropped. Let's say we allocate 128 DMA buffers @ 128 bytes-per-buffer in the Machine.I2S() instantiation call (pretty much the maximum allowed). That is 128 x 128 = 16,384 bytes of sample buffering. As mentioned, there are 8 bytes per sample, so the DMA buffers can hold 2048 samples. At 44.1kHz, that works out to 46.4 ms of buffer time (2048/44.1kHz).

This means that other Coroutines can run for 46.4 ms before yielding to the I2S coroutine. If they run longer the DMA buffer will get overrun and audio samples will be lost.

In summary, usasyncio can work with I2S if the system is designed taking into account the two design constraints described above. I think the 44.1 kHz example will be difficult to realize with the existing usasyncio implementation. Having all other coros complete in an average of 5.4 ms appears difficult to realize. However, many applications don't need CD sample rates. For example, my application is analysis of traffic noise which only needs a max freq of 5000 Hz (10k sample rate). With a 10k sample rate, design constraint #1 relaxes to 25.6 ms -- in my early tests this seems doable with usasyncio. But... still some coros to add, so this optimism might be a bit early.

My repo includes an example uPy file showing how audio samples are read from an I2S microphone, then written to a WAV file on an external SD Card. In this example 32 DMA buffers are used at 128 bytes/buffer: I2S example

What do you think? Based on your expert knowledge of usasyncio is there any hope for me to keep going with usayncio?

User avatar
Mike Teachman
Posts: 45
Joined: Mon Jun 13, 2016 3:19 pm
Location: Victoria, BC, Canada

Re: optimizing uasyncio performance

Post by Mike Teachman » Mon Dec 10, 2018 5:38 pm

kevinkk525 wrote:
Mon Dec 10, 2018 3:48 pm
might have something to do with time.ticks_ms() being a 64-bit integer on loboris port using heap. In a post on his forum he says that in the next update, the behaviour will change and time.ticks_ms() will only return a 64-bit integer if the value needs it, otherwise it'll return a small int that won't need heap space.
Thanks. I didn't know that. I can imagine that the uasyncio scheduler makes frequent calls to this function.

User avatar
pythoncoder
Posts: 3161
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: optimizing uasyncio performance

Post by pythoncoder » Tue Dec 11, 2018 5:30 am

@Mike Teachman Your two design constraints refer to the maximum (T1) and average (T2) time in which "other coros yield to the I2S task". Each task actually yields to the scheduler. So when the I2S task yields, every other task will get a chance to run. So T1 is the sum of the maximum time between yields of every other coro, and T2 is the sum of the average time between yields of every other coro.

It is possible to improve this using my fast_io fork of uasyncio. This enables I/O coros to be written in a way that ensures that, every time any coro yields, the scheduler checks for the ready status of the I/O coro. Then T1 is the maximum delay of any one coro, and T2 is the similar average. This is exactly the kind of use-case I had in mind when I produced the fork.

See my uasyncio repo for details.

It has to be said that the constraints are still quite demanding.
Peter Hinch

User avatar
Mike Teachman
Posts: 45
Joined: Mon Jun 13, 2016 3:19 pm
Location: Victoria, BC, Canada

Re: optimizing uasyncio performance

Post by Mike Teachman » Tue Dec 11, 2018 4:33 pm

pythoncoder wrote:
Tue Dec 11, 2018 5:30 am
It is possible to improve this using my fast_io fork of uasyncio. This enables I/O coros to be written in a way that ensures that, every time any coro yields, the scheduler checks for the ready status of the I/O coro. Then T1 is the maximum delay of any one coro, and T2 is the similar average. This is exactly the kind of use-case I had in mind when I produced the fork.
thanks for reading thru that long post and suggesting a possible solution. Giving the I2S task elevated priority in the uasyncio scheduler should help. I'll give it a try.

User avatar
mattyt
Posts: 160
Joined: Mon Jan 23, 2017 6:39 am

Re: optimizing uasyncio performance

Post by mattyt » Tue Dec 11, 2018 9:54 pm

I don't have anything further to add - but do please keep us posted as I'm really interested in how this progresses!

User avatar
Mike Teachman
Posts: 45
Joined: Mon Jun 13, 2016 3:19 pm
Location: Victoria, BC, Canada

Re: optimizing uasyncio performance

Post by Mike Teachman » Sat Dec 29, 2018 6:03 am

As expected, project time "yielded" to holiday time with family and friends ;) But, a few slices of time allowed me to experiment with the Fast IO implementation of uasyncio. https://github.com/peterhinch/micropyth ... ASTPOLL.md

A time-critical coroutine that handles microphone samples was promoted to a high-priority I/O queue that is offered in the Fast IO library. This had the desirable affect of reducing the average yield time latency in the microphone coroutine loop by about 20%.

Next, I measured the maximum yield time for the microphone coroutine, over 4 hours runtime. Using a high-priority I/O queue I expected a reduction in yield time. The result: a 30% reduction in maximum yield time (from 110ms -> 77ms). Yay!

Overall: Fast IO brings easily observable performance benefits to my application.

Another thing I accidentally learned is the performance penalty when a garbage collection (GC) happens. The LoBo port has a default 512kB heap for a 4MB PSRAM ESP32 module. If the GC threshold is 0, a GC with a 512kB heap takes a whopping 98ms. Reducing the GC threshold to 300000 reduces the GC time to 52ms. Still a lot. A GC stops the microphone coroutine from processing audio samples. But, the DMA memory with I2S is adequate to buffer samples during a GC, resulting in no lost audio samples.

I did some apples-to-apples GC performance tests with the LoBo port vs mainline uPy. I measured mainline uPy to have approx 3x faster GC when the GC threshold is 0. I posted this issue on the LoBo port forum to see if there is a way to improve this. https://loboris.eu/forum/showthread.php?tid=844

Lastly, major kudos for two uPy leaders on the work to provide uasyncio in uPy:
@pfalcon for the core uasyncio library in micropython-lib
@pythoncoder for the uasyncio extensions, performance improvements, and especially for the excellence in documentation+tutorials

User avatar
pythoncoder
Posts: 3161
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: optimizing uasyncio performance

Post by pythoncoder » Sat Dec 29, 2018 6:25 am

Re GC in my experience (mainly on Pyboard and ESP8266) GC time is radically reduced if you issue

Code: Select all

gc.collect()
explicitly at regular intervals. Ideally you do this at times when your code can tolerate a pause of a few ms.

Good to hear of a successful application for the fast_io fork :D
Peter Hinch

Post Reply