[SOLVED] uasyncio is blocked for 1 hour, 11 minutes on ESP8266

All ESP8266 boards running MicroPython.
Official boards are the Adafruit Huzzah and Feather boards.
Target audience: MicroPython users with an ESP8266 board.
kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by kevinkk525 » Mon May 20, 2019 8:38 pm

That is an interesting find! If that's true, it could explain why only the esp8266 is affected as the esp32 seems to use "esp_timer_get_time()" instead in mphalport.
I checked the functions you mention and to me (I don't know the inner workings of micropython) it looks like it is indeed possible to catch a value from system_get_time() that should have an incremented system_time_high_word which hasn't happened at that point.

But I don't fully understand the implications yet. Even if you get 2 values that are 1h 11 minutes apart, that doesn't mean that everything will be suspended for that duration. The first uasyncio task may get a waiting time of 1h 11 minutes but when the next task is being scheduled 10 ms later, ets_loop_iter should have run and system_time_high_word incremented. Then everything should work again as expected and only the first task is being stuck.

ets_loop_iter is called every time machine_idle() in modmachine.c is being called as part of ets_event_poll() in esp_mphal.c:

Code: Select all

STATIC mp_obj_t machine_idle(void) {
    uint32_t t = mp_hal_ticks_cpu();
    asm("waiti 0");
    t = mp_hal_ticks_cpu() - t;
    ets_event_poll(); // handle any events after possibly a long wait (eg feed WDT)
return MP_OBJ_NEW_SMALL_INT(t);
And although I can't find in the repo where machine_idle() is called, I know from needed workarounds in old esp32 version that this function is called very often.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by kevinkk525 » Mon May 20, 2019 9:17 pm

I thought I understand the implications but had to edit my post because I had thought it wrong.

In uasyncio core.py in method run():

Code: Select all

[...]
            if not self.runq:
                delay = -1
                if self.waitq:
                    tnow = self.time()
                    t = self.waitq.peektime()
                    delay = time.ticks_diff(t, tnow)
                    if delay < 0:
                        delay = 0
            self.wait(delay)
If the time returned in "tnow = self.time()" is off because system_time_high_word hasn't been incremented yet, it results in a delay that is off by 1h 11min. But negative! Therefore it results actually in no delay at all.
If the offset happens during the scheduling of the uasyncio task then peektime() returns a delay of 1h 11min but only if this is the only coroutine. When more coroutines are scheduled, only this one will be delayed by 1h 11 min while the others should continue to run correctly.

In __init__.py wait is implemented like this:

Code: Select all

    def wait(self, delay):
        if DEBUG and __debug__:
            log.debug("poll.wait(%d)", delay)
        # We need one-shot behavior (second arg of 1 to .poll())
        res = self.poller.ipoll(delay, 1)
        # log.debug("poll result: %s", res)
        # Remove "if res" workaround after
        # https://github.com/micropython/micropython/issues/2716 fixed.
        if res:
            for sock, ev in res:
                cb = self.objmap[id(sock)]
                if ev & (select.POLLHUP | select.POLLERR):
                    # These events are returned even if not requested, and
                    # are sticky, i.e. will be returned again and again.
                    # If the caller doesn't do proper error handling and
                    # unregister this sock, we'll busy-loop on it, so we
                    # as well can unregister it now "just in case".
                    self.remove_reader(sock)
                if DEBUG and __debug__:
                    log.debug("Calling IO callback: %r", cb)
                if isinstance(cb, tuple):
                    cb[0](*cb[1])
                else:
                    cb.pend_throw(None)
                    self.call_soon(cb)
But there doesn't seem to be a way for getting a delay of 1h 11 minutes unless there is only one single coroutine that has been affected by the problem but this is not what we experience.

So I don't know :D
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

jomas
Posts: 59
Joined: Mon Dec 25, 2017 1:48 pm
Location: Netherlands

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by jomas » Mon May 20, 2019 9:59 pm

I Think you are right that only one coroutine is blocking. (in fact it is not blocking it is just scheduled with a much to long delay)

You know why I concluded that the whole uasyncio was blocked?
Look at my test program (first post)
I had one coroutine 'time_loop' that updates self.second.
The other coroutine 'show' was to blink a led every minute to check if uasyncio was still working.
BUT 'show' waits for self.second to reach 0. BUT because 'time_loop' may have been given a too long delay because the bug, 'show' will wait 1 hour, 11, min also.
That made me make the wrong conclusion that asyncio was blocked completely.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by kevinkk525 » Mon May 20, 2019 10:09 pm

No, your assumption is right. The whole uasyncio is blocking for 1h 11 minutes, not only one coroutine. I tested this when I first discovered the problem by running multiple testing codes (publishing sensor values and logging if there's a too big delay between executions in another coroutine) and all coroutines were delayed for 1h 11 minutes. So the assumption is correct. We either don't see the real problem yet or draw the wrong conclusions. Mabye what you discovered is the reason for all this and we are just missing something.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
Roberthh
Posts: 3667
Joined: Sat May 09, 2015 4:13 pm
Location: Rhineland, Europe

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by Roberthh » Tue May 21, 2019 5:23 am

I had an event with @pythoncoder's simple test code this night. I used the settings for COUNT, MAX, MIN = 10000, 38, 34.
There was a fail with a time gap of 4295005 ms, which is 1h 11m 35s 5 ms, or about 2^32 µs. A single pass runs ~35ms, which is the deviation to 2^32.

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by pythoncoder » Tue May 21, 2019 5:39 am

jomas wrote:
Mon May 20, 2019 7:19 pm
...
Now time.ticks_ms() uses this counter through the HAL function mp_hal_ticks_ms() in ports/esp8266/esp_mphal.c.
It is implemented by this code:

Code: Select all

uint32_t mp_hal_ticks_ms(void) {
    return ((uint64_t)system_time_high_word << 32 | (uint64_t)system_get_time()) / 1000;
}
...
I agree with your comment on this code. However if this is the cause this would return a bad value for a very brief period and would surely result in only one coroutine having an incorrect time. This doesn't seem to accord with observed behaviour which seems to imply bad times being issued for a significant period (>1ms I'd have thought).

My C skills are decidedly rusty. But this function is declared as returning a 32 bit number (as I would expect). Yet it appears actually to return a 64 bit number. Does the compiler truncate it? If so, the high word will be discarded anyway. Why is it not written

Code: Select all

uint32_t mp_hal_ticks_ms(void) {
    return (uint32_t)(system_get_time()) / 1000);
}
At some point this must get converted to a MicroPython 30-bit small integer.

Of course uasyncio expects the ticks_ms() value to roll over. For the error to occur, and to be caused by timing, it would have to return a value offset by 1hr 11m for a period measured in ms. I don't think we've found a smoking gun.

I'm still looking forward to seeing if a zero delay coroutine is also stalled. This will tell us whether a time error is the underlying cause.
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by kevinkk525 » Tue May 21, 2019 6:46 am

Roberthh wrote:
Tue May 21, 2019 5:23 am
I had an event with @pythoncoder's simple test code this night. I used the settings for COUNT, MAX, MIN = 10000, 38, 34.
There was a fail with a time gap of 4295005 ms, which is 1h 11m 35s 5 ms, or about 2^32 µs. A single pass runs ~35ms, which is the deviation to 2^32.
Thank you @Roberthh for testing this!
This brings us on the right track. Apparently there is a case where "time.ticks_diff" returns 1h 11min.

The code in uasyncio responsible for the delay until the next coroutine is this:

Code: Select all

delay = 0
            if not self.runq:
                delay = -1
                if self.waitq:
                    tnow = self.time()
                    t = self.waitq.peektime()
                    delay = time.ticks_diff(t, tnow)
                    if delay < 0:
                        delay = 0
            self.wait(delay)
So if there is a case where time.ticks_diff() returns 1h 11 min, it blocks all executions for this duration (unless there were some pollers registered, which was never the case in my applications).

This means we now have to find out the reason why time.ticks_diff can return a 1h 11 min/2^32 offset.

The function in extmod/utime_mphal.c:

Code: Select all

STATIC mp_obj_t time_ticks_diff(mp_obj_t end_in, mp_obj_t start_in) {
    // we assume that the arguments come from ticks_xx so are small ints
    mp_uint_t start = MP_OBJ_SMALL_INT_VALUE(start_in);
    mp_uint_t end = MP_OBJ_SMALL_INT_VALUE(end_in);
    // Optimized formula avoiding if conditions. We adjust difference "forward",
    // wrap it around and adjust back.
    mp_int_t diff = ((end - start + MICROPY_PY_UTIME_TICKS_PERIOD / 2) & (MICROPY_PY_UTIME_TICKS_PERIOD - 1))
                   - MICROPY_PY_UTIME_TICKS_PERIOD / 2;
    return MP_OBJ_NEW_SMALL_INT(diff);
}
My C is a bit too rusty to fully understand what is done here and I don't know where MICROPY_PY_UTIME_TICKS_PERIOD is updated.
But considering that this is an extmod used by all ports and the problem being specific to the esp8266, I'd assume that his function works correctly.
This leads us back to to esp_mphal.c: If the high_word is indeed truncated by the compiler anyway, wouldn't that mean that the returned value would overflow and be back at 0 at one point? However ticks_diff() would then return a negative number and there would be no delay in uasyncio.
So from my knowledge without fully understanding ticks_diff I'm stuck here but at least we got prove that this is a problem within the timing modules on the esp8266.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

jomas
Posts: 59
Joined: Mon Dec 25, 2017 1:48 pm
Location: Netherlands

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by jomas » Tue May 21, 2019 8:36 am

@pythoncoder,
Your suggested mp_hal_ticks_ms will not work because system_get_time() will roll-over after 1:11H.
Your assumption that the compiler truncate is true. The result of the 64bit-counter/1000 will just return the lower 32bits as a unsigned int.

@kevinkk525,
1. MICROPY_PY_UTIME_TICKS_PERIOD-1 is just a #define in mpconfig.h that limits the counter value to a 30-bits counter. After that it is magically packed in a so called MP_OBJ_NEW_SMALL_INT.

2. calculating with unsigned int's will handle overflows correctly implicitly because they use unsigned integers.

for example take the function mp_hal_delay_us in esp_mphal.c,
it is codes like this:

Code: Select all

void mp_hal_delay_us(uint32_t us) {
    uint32_t start = system_get_time();
    while (system_get_time() - start < us) {
        ets_event_poll();
    }
}
Suppose you call it with us=100 and system_get_time() is just 50uS before roll-over.
You might think that this can break when system_get_time() rolls-over during execution but it does not. Because both start and system_get_time() are unsigned int's the result of 'system_get_time() - start' can never be negative.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by kevinkk525 » Tue May 21, 2019 8:56 am

I tried to run some tests on my arduino:

Code: Select all

uint32_t system_time_high_word=0;
uint32_t current_time = 4294967295; //just before overflow
uint32_t MICROPY_PY_UTIME_TICKS_PERIOD =0; 

uint32_t system_get_time(void) {
  return current_time;
}

uint32_t mp_hal_ticks_ms(void) {
    return ((uint64_t)system_time_high_word << 32 | (uint64_t)system_get_time()) / 1000;
}

int32_t time_ticks_diff(uint32_t end_in, uint32_t start_in) {
    // we assume that the arguments come from ticks_xx so are small ints
    uint64_t start = uint64_t(start_in);
    uint64_t end = uint64_t(end_in);
    // Optimized formula avoiding if conditions. We adjust difference "forward",
    // wrap it around and adjust back.
    int64_t diff = ((end - start + MICROPY_PY_UTIME_TICKS_PERIOD / 2) & (MICROPY_PY_UTIME_TICKS_PERIOD - 1))
                   - MICROPY_PY_UTIME_TICKS_PERIOD / 2;
    return int32_t(diff);
}

uint32_t time_ticks_ms(void) {
  return uint32_t(mp_hal_ticks_ms() & (MICROPY_PY_UTIME_TICKS_PERIOD -1));
}

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  uint32_t cur_time_hal = mp_hal_ticks_ms();
  uint32_t cur_time_py = time_ticks_ms();
  Serial.print("Current time hal: ");Serial.print(cur_time_hal);Serial.println("ms");
  Serial.print("Current time upy: ");Serial.print(cur_time_py);Serial.println("ms");
  Serial.print("Current time counter uint32_t: ");Serial.println(current_time);

  current_time++;
  uint32_t later_time_hal = mp_hal_ticks_ms();
  uint32_t later_time_py = time_ticks_ms();
  Serial.print("Later time hal with overflow in system_get_time without high word increment: ");Serial.print(later_time_hal);Serial.println("ms");
  Serial.print("Later time upy with overflow in system_get_time without high word increment: ");Serial.print(later_time_py);Serial.println("ms");
  Serial.print("time diff hal: ");Serial.print(later_time_hal-cur_time_hal);Serial.println("ms");
  Serial.print("time diff upy: ");Serial.print(time_ticks_diff(later_time_py, cur_time_py));Serial.println("ms");
  
  system_time_high_word++;
  later_time_hal = mp_hal_ticks_ms();
  later_time_py = time_ticks_ms();
  Serial.print("Later time hal with high word increment: ");Serial.print(later_time_hal);Serial.println("ms");
  Serial.print("Later time upy with high word increment: ");Serial.print(later_time_py);Serial.println("ms");
  Serial.print("time diff hal: ");Serial.print(later_time_hal-cur_time_hal);Serial.println("ms");
  Serial.print("time diff upy: ");Serial.print(time_ticks_diff(later_time_py, cur_time_py));Serial.println("ms");
}
Results in:

Code: Select all

10:40:08.395 -> Current time hal: 4294967ms
10:40:08.395 -> Current time upy: 4294967ms
10:40:08.395 -> Current time counter uint32_t: 4294967295
10:40:08.395 -> Later time hal with overflow in system_get_time without high word increment: 0ms
10:40:08.395 -> Later time upy with overflow in system_get_time without high word increment: 0ms
10:40:08.429 -> time diff hal: 4290672329ms
10:40:08.429 -> time diff upy: -4294967ms
10:40:08.429 -> Later time hal with high word increment: 4294967ms
10:40:08.429 -> Later time upy with high word increment: 4294967ms
10:40:08.429 -> time diff hal: 0ms
10:40:08.429 -> time diff upy: 0ms
So there is the possibility for the time.ticks_diff() to return an offset of roughly 2^32us when calling time.ticks_ms() before the system_time_high_word gets incremented but the diff would be negative. This wouldn't impact uasyncio. So I'm not quite sure where the error comes from.
I'm now running the test from pythoncoder slightly changed to see where things go wrong, what delta and at which ticks_ms.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

jomas
Posts: 59
Joined: Mon Dec 25, 2017 1:48 pm
Location: Netherlands

Re: uasyncio is blocked for 1 hour, 11 minutes (Bug?)

Post by jomas » Tue May 21, 2019 9:04 am

MICROPY_PY_UTIME_TICKS_PERIOD = 2^30.
So MICROPY_PY_UTIME_TICKS_PERIOD-1 is used as a mask of 30bits (0b111111111111111111111111111111)

Post Reply