[workaround used] esp8266 gets stuck periodically

All ESP8266 boards running MicroPython.
Official boards are the Adafruit Huzzah and Feather boards.
Target audience: MicroPython users with an ESP8266 board.
kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

[workaround used] esp8266 gets stuck periodically

Post by kevinkk525 » Tue Apr 24, 2018 7:55 am

I'm running a smarthome project based on uasyncio which publishes sensor values using mqtt. It works fine for several days but sometimes gets stuck for 1h 10 minutes, then continues to work.
Within that time the WIFI connection is there but the mqtt library of Peter Hinch reconnects after that time.
I have a coroutine running that publishes a log message every 10 minutes and checks if really 10 minutes went past since the last execution and logs the difference. After the outage of 1h 10 minutes I get a log that 4311 seconds have past since the last execution, which is ~1h 10 minutes. That shows that something is completely blocking the microcontroller for that period without anything being executed during that time.

I'm using
- uasynico V2.0 (but that problem existed before too).
- MQTT-library of Peter Hinch
(- I2C with HTU21D (using I2C), publishing every 10 minutes), occurs even on devices without it
- gc.collect() every 10 minutes
- a few mqtt subscriptions

Maybe someone has an idea of what could be causing this kind of blocking. I would appreciate it.
I can't find the problem as basically everything is normal python.

Here are the coroutines running (except the ones from the mqtt library):

RAM:

Code: Select all

async def __gc(interval):
    while True:
        gc.collect()
        log.info(gc.mem_free(), local_only=True)
        await asyncio.sleep(interval)


async def __ram(topic, interval, interval_gc=10):
    asyncio.get_event_loop().create_task(__gc(interval_gc))
    await asyncio.sleep(12)
    while True:
        gc.collect()
        await mqtt.publish(topic, gc.mem_free())
        await asyncio.sleep(interval)
Last edited by kevinkk525 on Thu Jul 26, 2018 4:17 pm, edited 1 time in total.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Wed Apr 25, 2018 9:12 am

Debugging problems like this is notoriously difficult, as is achieving long term reliability with MQTT over a wireless link.

There is nothing in my code which would produce a time delay of anything approaching an hour so I'd suspect some kind of external event such as a very long burst of radio interference. Even so, that should not lock the scheduler: my code uses nonblocking sockets. The only time it blocks is when doing a DNS lookup as nobody has written a nonblocking version for MicroPython. However this is called once only at startup with the address being stored for later restarts. So if my code blocks while running it has an undiscovered bug.

It is all to easy to write code which inadvertently locks the scheduler. It might be worth making an absolute minimum test case and re-testing to check you haven't inadvertently done this. Incidentally my driver does periodically do a gc.collect() so there should be no need to duplicate this.

Aside from this I'm at rather a loss as to how to proceed. I spent a considerable amount of time testing this under every adverse condition I could create and never experienced a lockup. Without a means of replicating this where I live I'm rather stumped.
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Wed Apr 25, 2018 9:20 am

Thanks for your answer. I am sure it's not your mqtt library, because you put a lot of effort to test it.

My WIFI is quite stable and the problem does not occur on multiple µC at the same time. It's always one µC.

It's good to know that you never encountered this kind of blocking. I will try to produce a small code to reproduce it but this can take a lot of time as the problem sometimes occurs only after one week.

Another idea I had is to measure the execution of every coroutine (inside the while loop until sleep is called) to see which one actually blocks but that is difficult to retrofit and will take some time.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Wed Apr 25, 2018 11:03 am

Thinking some more about this, what makes you sure that asyncio is stuck? The symptom as you describe it is that communication stops for 1hr 10 minutes. This can happen if there is RF interference or if you take the device out of range of the AP (I appreciate you're not actually doing this). The driver will reconnect once this condition clears. This is normal behaviour but the scheduler should run throughout. Your description of the symptom seems consistent with that behaviour rather than necessarily describing a scheduler lock-up. Or am I missing something?

If asyncio is locking up and you can catch it in the act, issuing ctrl-c will tell you exactly where the fault lies. I appreciate this is not easy with such a rarely occurring fault.
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Wed Apr 25, 2018 11:48 am

It's not the communication that stops during the time of 1h 10 minutes, it's the scheduler (or something blocking the scheduler).

I run this coroutine:

Code: Select all

async def _stability_log(interval):
    st = time.ticks_ms()
    while True:
        await asyncio.sleep(interval)
        st_new = time.ticks_ms()
        diff = time.ticks_diff(st_new, st) / 1000
        st = st_new
        log.debug("Still online, diff to last log: {!s}".format(diff))
        if diff > 600 * 1.1 or diff < 600 * 0.9:
            log.warn("Diff to last log not within 10%: {!s}, expected {!s}".format(diff, interval))
        gc.collect()
It sends a log message every 10 minutes and warns if the next iteration did not occur within 10minutes +/- 10%.
The logs are being send in another asyncio task and log.debug is therefore non-blocking in this code. Logs are send over mqtt.

Therefore if the scheduler does not block but only communications is stuck, this coroutine should never encounter any "lost time" between iterations and once the communication is back up, it should send a lot of log messages at once (or run out of memory..).

But the actual behaviour is different. It indeed loses the 1h 10 minutes and reports that as a warning log entry. Therefore I have to assume, that the scheduler is either blocking or being blocked by something else. Being blocked by something else is going to be hard to find although there are only 3 coroutines running + mqtt coroutines of your library (which I highly doubt are the problem).

Issuing ctrl-c will be quite difficult but I can try.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Thu Apr 26, 2018 6:17 am

OK, that's a good test. It also rules out the ESP8266 crashing. It rather suggests a busy-wait loop which fails to yield while waiting on some external condition. When the external condition clears, the loop exits and the scheduler restarts.

I'll review my code (probably at the weekend) to see if I can spot such a bug. A bug in my code is more probable than one in the scheduler...

By way of background is your broker on your LAN or the internet? Is there any pattern to when these outages occur?
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Thu Apr 26, 2018 7:28 am

Thanks for your input.

I decorated all coroutines and some functions to send a log message if they take longer than a predefined time (made the while loops a different coroutine), even your mqtt library. This way I'll hopefully find the coroutine responsible for the freeze.
This should also uncover problems in your software so don't put time in it until I get a result.

My broker is on LAN and there does not seem to be any pattern to when these outages occur.

I expect to get some new results within the next week.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

c_y
Posts: 1
Joined: Wed Jun 28, 2017 4:26 pm

Re: Asyncio gets stuck periodically

Post by c_y » Tue May 01, 2018 9:39 pm

I wrote a 'truly async' asyncio mqtt library a few years ago, it's been working flawlessly for about a year on a few different devices, things like Sonoffs and other custom hardware. My wifi is really terrible, so obviously if the client roams onto a bad AP the connectivity suffers temporarily, but it always come back. I haven't built QOS=1 support yet, so a publish could get lost, but as soon as the wifi comes back up (which the ESP/Micropython do automatically) then it reconnects and data starts flowing again.

Mine is 248 lines, so it does compile fine on the esp8266. Let me know if you guys would like to try it.

The philosophy of mine was: 1) Let the platform handle the wifi, because it seems to work fine. I don't ever want to call the wifi functions again because those incur flash writes, so it's dangerous to mess with that automatically; 2) Separate the byte-by-byte message accumulation and sending from the processing of a whole accumulated message. This client is intended to really cope with TCP's byte-oriented stream nature, both inbound and outbound. 3) I put the broker's IP addresses, because DNS can block.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Thu May 10, 2018 7:48 am

All my decorations of running code did not give any result. 2 microcontroller went offline after ~1 week for 1h 10 minutes but none of my decorated functions were the reason. I will now change uasyncio core to measure the time a coroutine is executed. I probably won't be able to get the name or line of code where that happens but at least I'll know if it's a part of my code or something else.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Thu May 10, 2018 8:19 am

Almost certainly coincidence but 0x1000 seconds == 1hr 8.27 minutes. I can find no similar properties for 1hr 10mins in ms or μs.

Good luck. These things can be pigs to track down.
Peter Hinch
Index to my micropython libraries.

Post Reply