Page 1 of 3

[SOLVED] uasyncio is blocked for 1 hour, 11 minutes on ESP8266

Posted: Fri May 17, 2019 12:21 pm
by jomas
UPDATE may 22: Cause found. The blocking is caused by a bug in mp_hal_ticks_ms.

Under certain conditions uasyncio is blocked for exactly 1 hour, 11 minutes. This was first reported in topic viewtopic.php?f=16&t=4706. I have been running several esp8266 (ESP-12) units to try to reduce the code that shows this behavior. I came to this code:

Code: Select all

# Date: april 26, 2019
import utime
from machine import Pin, Timer
import uasyncio as asyncio

class MP_BUG:
    
    def __init__(self):
        Pin(12, Pin.OUT).value(0); Pin(13, Pin.OUT).value(0); Pin(15, Pin.OUT).value(0)
        self.buz = Pin(5, Pin.OUT) ; self.buz.value(0)
        self.led = Pin(2, Pin.OUT) ; self.led.value(1)
        self.tim = Timer(-1)
        self.tim.init(period=5000, callback=self.watchdog)
        self.second = self.watchdog_second = 0

    def watchdog(self, timer):
        print("Watchdog", self.second)
        if self.second == self.watchdog_second:
            self.buz.value(1)
        self.watchdog_second = self.second
        
    def now_sec(self):
        """Return time."""
        return utime.ticks_us()//1000000

    @asyncio.coroutine
    def time_loop(self):
        print("Start coroutine time_loop")
        while True:
            now = utime.localtime(self.now_sec())
            self.second = now[5]
            await asyncio.sleep_ms(10)

    @asyncio.coroutine
    def show(self):
        print("Start coroutine show")
        while True:
            while not self.second == 0:
                await asyncio.sleep_ms(100)
            # Blink 30 times
            for i in range(30):
                self.led.value(0)
                await asyncio.sleep_ms(4)
                self.led.value(1)
                await asyncio.sleep_ms(50)
            await asyncio.sleep(2)

    def run(self):
        loop = asyncio.get_event_loop()
        loop.create_task(self.show())
        loop.run_until_complete(self.time_loop())

stop = Pin(4, Pin.IN, Pin.PULL_UP)
utime.sleep(2)
if not stop.value() == 0:
    mpbug = MP_BUG()
    mpbug.run()
Two coroutines:
'time_loop', that update self.seconds.
'show' that waits until self.seconds is 0 and then blinks a led for 30 times.

The watchdog watch if the seconds get updated. If not then self.buz (Buzzer) will get high.

If this code is run (just put this as main.py on your esp8266.) then after a while (can be hours or days) the coroutines are blocked for exactly 1 hour, 11 minutes. After this time the coroutines and the program will continue as if nothing happened.
(btw. 1 hour, 11 minutes is 2^32 uSeconds, that is one cycle of the internal system_get_time() call of the Espressif SDK.)

The same (more or less) code without uasyncio never blocks and this unit is running continuously for months without problems.
I have tried the standard 1.10 version of MP and the version of pfalcon (pycopy, uasyncio V2) but they al have the same behaviour.

How can this bug reported? I don't have a github account.
Anyone has suggestions to avoid this blocking without skipping uasyncio?

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Fri May 17, 2019 5:45 pm
by kevinkk525
I'm glad you put that much effort into confirming the bug I encountered.
I can try to submit it to github. Maybe something will get found although apparently most people don't have the problem and are having trouble confirming it.

I accepted that I need a watchdog resetting the board if the freeze happens and that works fine but of course finding a bugfix would be preferrable.

Edit: I create an issue on github: https://github.com/micropython/micropython/issues/4795

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Sat May 18, 2019 11:06 am
by pythoncoder
You (both) might like to look at my comment in that issue which suggests ways to narrow down the cause.

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Sat May 18, 2019 10:22 pm
by jomas
Thanks kevinkk525 for reporting the problem.
@pythoncoder, I have added a coroutine with await asyncio.sleep(0) in my "test" program as you suggested.
It is running now. If the testprogram blocks I will report the results.

Added code:

Code: Select all

    @asyncio.coroutine
    def pythoncoder(self):
        print("Start coroutine pythoncoder")
        while True:
            self.pycod += 1
            if self.pycod > 200:
                self.pycod = 0
                self.redled.value(not self.redled.value())
            await asyncio.sleep_ms(0)

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Sun May 19, 2019 11:10 am
by jomas
uasyncio imports and uses utimeq.
If I look at the source of utimeq I see the following code:

Code: Select all

STATIC bool time_less_than(struct qentry *item, struct qentry *parent) {
    mp_uint_t item_tm = item->time;
    mp_uint_t parent_tm = parent->time;
    mp_uint_t res = parent_tm - item_tm;
    if (res == 0) {
        // TODO: This actually should use the same "ring" logic
        // as for time, to avoid artifacts when id's overflow.
        return item->id < parent->id;
    }
    if ((mp_int_t)res < 0) {
        res += MODULO;
    }
    return res && res < (MODULO / 2);
}
That "TODO", can this be the problem of the blocking?
because 'time_less_than' is used in 'heap_siftdown', heap_siftup and heappush.

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Sun May 19, 2019 12:41 pm
by pythoncoder
Interesting point. I haven't studied the utimeq C code in detail but that code looks like a "tie-breaker" to ensure a deterministic response if the two items being compared have exactly equal times. The routine returns the result of comparing their ID's. This seems a logical thing to do.

I thought that in MicroPython the ID corresponds to the address of the object. In which case the comment about overflows doesn't make much sense. So I may be misunderstanding this and need to dig a lot deeper unless you've got there first.

My puzzlement over this bug is the correspondence with a microsecond rollover when uasyncio uses a millisecond timer. This is what leads me to suspect a problem with the environment on the ESP8266 - possibly the vendor code - rather than with uasyncio itself. I don't know any place in uasyncio where microseconds are used. If the fault is in the ESP8266 environment it could be a glitch in the millisecond timer or something which is stopping the MicroPython VM. The formwer strikes me as most likely, not least because I think someone would by now have spotted the VM stalling for such a long period.

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Sun May 19, 2019 3:10 pm
by kevinkk525
I would also assume that it is something esp8266 specific. Either in micropython firmware or the esp-idf.
I'm only running an ESP32 occasionally so I don't have a good comparison but didn't encounter the problem there. Only devices anyone reported this problem yet is the esp8266. If the problem affected other devices too, we should have heard about it by now. That makes a problem with a universal micropython library improbable unless it relies on esp8266-specific libraries/esp-idf which could introduce undesired behaviour (like relying on esp-idf timing that might not be accurate?).

I will try to set up multiple esp8266 during the next weeks with different testing codes as suggested by pythoncoder. Maybe it's possible to narrow the problem down.

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Sun May 19, 2019 5:52 pm
by jomas
@pythoncoder, all time functions are more or less derived from the esp-sdk call 'system_get_time()' this is just a 32 bits counter that counts from 0 -> 2^32-1 and then rolls over to 0. I hardly can believe that this call would block. I have not found any reports of this.
Also my non-uasyncio version is running without resets or blocking for months now.

That the problem is related to esp8266 specific however is true in my opinion.
If I look at the specific time code for the esp8266 'esp_mphal.c' may have some clues.
For example if 'mp_hal_delay_us' is called with a small negative number (for example -4000) then this call will block for about 1 hour 11 minutes.
But as far as I see all calls to mp_hal_delay_us/ms are protected from calling with negative numbers. So I realy don't know where to look now.
I will just wait for my esp with your additional code to block.

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Mon May 20, 2019 5:44 am
by pythoncoder
jomas wrote:
Sun May 19, 2019 5:52 pm
@pythoncoder, all time functions are more or less derived from the esp-sdk call 'system_get_time()' this is just a 32 bits counter that counts from 0 -> 2^32-1 and then rolls over to 0. I hardly can believe that this call would block. I have not found any reports of this.
Also my non-uasyncio version is running without resets or blocking for months now.

That the problem is related to esp8266 specific however is true in my opinion...
The interface between uasyncio and the HAL is via utime.ticks_ms. Blocking is, as you say, improbable. But is it possible that, at a certain point in the cycle of the underlying 32 bit counter, the MicroPython ticks_ms implementation on ESP8266 is returning incorrect millisecond values? Values offset by this exact value? This would explain the behaviour - IF, during the outages, the zero delay test thread continues to run.

Note I have no knowledge of the HAL. The above is guesswork based on my understanding of uasyncio.

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Posted: Mon May 20, 2019 7:19 pm
by jomas
@pythoncoder, @kevinkk525,
I think I found the bug.

In ports/esp8266/ets_alt_task.c there is a function 'ets_loop_iter'. This function is called in the main loop/during idle I think.
This function implements a 64 bits counter that represents the system time in uSeconds. A snippet of the code is as follows.

Code: Select all

// to implement a 64-bit wide microsecond counter
static uint32_t system_time_prev = 0;
uint32_t system_time_high_word = 0;

bool ets_loop_iter(void) {
    if (ets_loop_iter_disable) {
        return false;
    }

    // handle overflow of system microsecond counter
    ets_intr_lock();
    uint32_t system_time_cur = system_get_time();
    if (system_time_cur < system_time_prev) {
        system_time_high_word += 1; // record overflow of low 32-bits
    }
    system_time_prev = system_time_cur;
    ets_intr_unlock();
As you can see the 64 bits counter is divided into two parts. system_get_time() and system_time_high_word.
Every roll-over of system_get_time() is counted in system_time_high_word.

Now time.ticks_ms() uses this counter through the HAL function mp_hal_ticks_ms() in ports/esp8266/esp_mphal.c.
It is implemented by this code:

Code: Select all

uint32_t mp_hal_ticks_ms(void) {
    return ((uint64_t)system_time_high_word << 32 | (uint64_t)system_get_time()) / 1000;
}
Now suppose you call time.ticks_ms() just before the roll-over of system_get_time()
Then if you call time.ticks_ms() again and the roll-over happens in this call just before evaluating system_get_time() in mp_hal_ticks_ms. Then you will have a gap of 2^32 uSeconds == 1 hour, 11 minutes because mp_hal_ticks_ms() uses the old time_high_word which is not incremented yet.

So 2 calls of time.ticks_ms() right after each other can have a time gap of 1hour, 11min.
As you said the timing of uasyncio depends on time.ticks_ms() so this is what is causing the blocking in my opinion.
Do you agree or am I missing something?