[workaround used] esp8266 gets stuck periodically

All ESP8266 boards running MicroPython.
Official boards are the Adafruit Huzzah and Feather boards.
Target audience: MicroPython users with an ESP8266 board.
kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Tue May 29, 2018 9:46 am

Now it's getting a bit interesting here:
- Interrupt is being executed during outages, which is good so I can build a watchdog
- I get this log from time to time

Code: Select all

Coroutine <generator object '_handle_msg' at 3fff6e50> took 4294968ms
which is interesting, as this is exactly the time that the microcontroller is offline sometimes. But it does not get logged directly after the outage and more often which means that this is not the reason for the outage.
On the other hand that log would actually mean that no other coroutine was executed for 1h 10 minutes as only the execution of a coroutine until the next yield is meassured. But as I log something else every 10 minutes I know that this did not happen and there were coroutines scheduled during that time.

This makes me wonder if my changes to uasyncio to log the execution time of coroutines work correctly. Changes are covered in ##########

Code: Select all

    def run_forever(self):
        cur_task = [0, 0, 0]
        while True:
            # Expire entries in waitq and move them to runq
            tnow = self.time()
            while self.waitq:
                t = self.waitq.peektime()
                delay = time.ticks_diff(t, tnow)
                if delay > 0:
                    break
                self.waitq.pop(cur_task)
                if __debug__ and DEBUG:
                    log.debug("Moving from waitq to runq: %s", cur_task[1])
                self.call_soon(cur_task[1], *cur_task[2])

            # Process runq
            l = len(self.runq)
            if __debug__ and DEBUG:
                log.debug("Entries in runq: %d", l)
            while l:
                cb = self.runq.popleft()
                l -= 1
                args = ()
                if not isinstance(cb, type_gen):
                    args = self.runq.popleft()
                    l -= 1
                    if __debug__ and DEBUG:
                        log.info("Next callback to run: %s", (cb, args))
                    #############################
                    st = time.time()
                    cb(*args)
                    et = time.time()
                    if (et - st) > 10000:
                        global_log.error("Callback {!s} took {!s}ms".format(cb, et - st))
                    #############################
                    continue

                if __debug__ and DEBUG:
                    log.info("Next coroutine to run: %s", (cb, args))
                self.cur_task = cb
                delay = 0
                ##############################
                st = time.ticks_ms()
                #############################
                try:
                    if args is ():
                        ret = next(cb)
                    else:
                        ret = cb.send(*args)
                    if __debug__ and DEBUG:
                        log.info("Coroutine %s yield result: %s", cb, ret)
                    if isinstance(ret, SysCall1):
                        arg = ret.arg
                        if isinstance(ret, SleepMs):
                            delay = arg
                        elif isinstance(ret, IORead):
                            cb.pend_throw(False)
                            self.add_reader(arg, cb)
                            #############################
                            et = time.ticks_ms()
                            if (et - st) > 10000:
                                global_log.error("Coroutine {!s} took {!s}ms".format(cb, et - st))
                            ##############################
                            continue
                        elif isinstance(ret, IOWrite):
                            cb.pend_throw(False)
                            self.add_writer(arg, cb)
                            #############################
                            et = time.ticks_ms()
                            if (et - st) > 10000:
                                global_log.error("Coroutine {!s} took {!s}ms".format(cb, et - st))
                            ##############################
                            continue
                        elif isinstance(ret, IOReadDone):
                            self.remove_reader(arg)
                        elif isinstance(ret, IOWriteDone):
                            self.remove_writer(arg)
                        elif isinstance(ret, StopLoop):
                            return arg
                        else:
                            assert False, "Unknown syscall yielded: %r (of type %r)" % (ret, type(ret))
                    elif isinstance(ret, type_gen):
                        self.call_soon(ret)
                    elif isinstance(ret, int):
                        # Delay
                        delay = ret
                    elif ret is None:
                        # Just reschedule
                        pass
                    elif ret is False:
                        # Don't reschedule
                        #############################
                        et = time.ticks_ms()
                        if (et - st) > 10000:
                            global_log.error("Coroutine {!s} took {!s}ms".format(cb, et - st))
                        ##############################
                        continue
                    else:
                        assert False, "Unsupported coroutine yield value: %r (of type %r)" % (ret, type(ret))
                except StopIteration as e:
                    if __debug__ and DEBUG:
                        log.debug("Coroutine finished: %s", cb)
                    #############################
                    et = time.ticks_ms()
                    if (et - st) > 10000:
                        global_log.error("Coroutine {!s} took {!s}ms".format(cb, et - st))
                    ##############################
                    continue
                except CancelledError as e:
                    if __debug__ and DEBUG:
                        log.debug("Coroutine cancelled: %s", cb)
                    #############################
                    et = time.ticks_ms()
                    if (et - st) > 10000:
                        global_log.error("Coroutine {!s} took {!s}ms".format(cb, et - st))
                    ##############################
                    continue

                #############################
                et = time.ticks_ms()
                if (et - st) > 10000:
                    global_log.error("Coroutine {!s} took {!s}ms".format(cb, et - st))
                ##############################
                # Currently all syscalls don't return anything, so we don't
                # need to feed anything to the next invocation of coroutine.
                # If that changes, need to pass that value below.
                if delay:
                    self.call_later_ms(delay, cb)
                else:
                    self.call_soon(cb)

            # Wait until next waitq task or I/O availability
            delay = 0
            if not self.runq:
                delay = -1
                if self.waitq:
                    tnow = self.time()
                    t = self.waitq.peektime()
                    delay = time.ticks_diff(t, tnow)
                    if delay < 0:
                        delay = 0
            self.wait(delay)
Also I checked that _handle_msg only exits if the connection to the broker breaks down but I don't have any log saying that it reconnected (which happens if the function self.connect is executed as my function is saved as self._connected_handler).
A reconnected log happens from time to time but never close to "Coroutine <generator object '_handle_msg' at 3fff6e50> took 4294968ms" but sometimes close to the log about the outage (which makes sense of course as it was offline).

So I'm actually quite confused about the information I got. Maybe @pythoncoder can help me out with this as it's mostly about your mqtt library and you know asyncio very well?
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Tue May 29, 2018 3:05 pm

A few thoughts:

When differencing times you should use utime.ticks_diff(). Subtraction can produce errors owing to modulo 2**30 rollover. While this evidently isn't happening here it could cause confusion.

Your debugging efforts seem to be concentrated on usayncio. My own view is that my library is more likely to have a bug, provoked by some environmental issue on your site. This would explain why you experience this problem when I and others haven't. There are a lot of special cases with radio links and it's very possible that I've missed one. So I'd instrument my code rather than that of uasyncio. Do you have a reason for suspecting uasyncio? When I was developing this module I had no issues with uasyncio.

As you probably are aware the _handle_msg coro does run for very long periods - in fact if the connection is maintained it runs forever. It seems odd if it seems to be running for the duration of the outage. Unless it's periodically being rescheduled for the duration with repeated attempts at reconnection.

You will find that my MQTT_base has a DEBUG class variable. If you set that True it will print some hopefully helpful messages.
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Tue May 29, 2018 4:24 pm

Thank you for your thoughts.
utime.ticks_diff() is what I should have used. Will retrofit it.

I don't suspect asyncio to be the problem anymore (sorry that the title still says that) but I use uasyncio to debug and measure every coroutine as it is easier to modify loop.run_forever() than every coroutine running. This way I can measure the time that is spent on every coroutine until it yields and I (theoretically) should see which coroutine is responsible for the outage. So I asked you for help with uasyncio to check my changes so that I did not do a dumb mistake that would invalidate my measurements. And also because I use your mqtt library.

It's possible that your library has a bug but it could also be possible that there's a bug in socket or somewhere else networking related. The last socket bug was fixed not long ago but did not change anything in that problem.

And that's where I get confused, I know that _handle_msg coro does run for a long time (until connection breaks down) but according to my measurement changes in uasyncio loop.run_forever I'm not measuring the total execution time of a coroutine but only the execution time of a single blocking call to a part of the coroutine until it yields. Therefore a value that large would actually mean that a single part of the coroutine blocks everything for 1h 10 minutes (except if I made a mistake in the way I put that measurements into the uasyncio code).

The debug messages in your MQTT class are nice but not helpful as I would have to have the microcontroller attached to a PC and save every output to a file for one week until that problem occurs again. That's why I made the debug messages (measurements) in uasyncio to be send over the network. Change the mqtt class to do the same could result in a lot of loops and a lot of traffic but I could try to do that.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Wed May 30, 2018 5:27 am

kevinkk525 wrote:
Tue May 29, 2018 4:24 pm
...Therefore a value that large would actually mean that a single part of the coroutine blocks everything for 1h 10 minutes...
That is precisely the type of error I suspect we're encountering. As I'm sure you're aware, when a cooperative scheduler locks up it's almost invariably because a coro is blocking, either by accessing a blocking resource or by performing an iterative loop which for a period of time fails to yield. These errors are easily made. And, as you suggest, it's possible that there is an error in the socket code which causes a nonblocking socket to block under some exotic circumstance. This had also occurred to me.

In my experience the way to track these down is to instrument the application. It seems you've identified the coro which is blocking but we need to find where this is occurring. Perhaps the dprint() method should be modified to emit a timestamp. Would it be feasible to add an sdcard and log to that?

I agree the method is prone to generating a lot of data and I don't envy you having to locate such a rarely occurring fault. But at the moment I can't think of a better way to track it down :(
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Wed May 30, 2018 6:12 am

The interesting thing about the value is that this single part of the coroutine is actually not blocking as during that time other code is being executed. That's why I'm not sure about the correctnes of my measurement changes to uasyncio.

I can change the dprint() statement to log to an sdcard. I will try to attach that module I have laying around somewhere and try that. Maybe there will be some more information in there.
Additionally I will measure every socket operation and log that result if it exceeds a few seconds.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Thu May 31, 2018 7:42 am

I think that's a good plan.

Instrumenting uasyncio is fine for learning how uasyncio works but I fear it could be confusing when debugging an application.
Peter Hinch
Index to my micropython libraries.

newtocoding
Posts: 1
Joined: Tue Jun 12, 2018 12:17 am

Re: Asyncio gets stuck periodically

Post by newtocoding » Sat Jun 16, 2018 4:44 am

New to micropython and I'm looking for some assistance getting started with MQTT client on an ESP8266. Not sure where to start. I've been googling for awhile. Basically trying to connect 8266 to a pi running mosquitto that is already setup. I really need some very basic information to get going. This may not be the are to place this info... my apologies.

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Starting out with umqtt

Post by pythoncoder » Sun Jun 17, 2018 9:38 am

Firstly, to get help in a forum it's best to post a new topic. Then your query will be seen by everyone. Hijacking an existing long thread confuses the discussion. Secondly your query will only be seen by people who have an interest in following that particular thread.

To answer your question the best place to look is at these example programs.
Peter Hinch
Index to my micropython libraries.

kevinkk525
Posts: 969
Joined: Sat Feb 03, 2018 7:02 pm

Re: Asyncio gets stuck periodically

Post by kevinkk525 » Thu Jun 21, 2018 6:59 am

It feels like my esp8266 are making fun of me.. They now worked over 2 weeks without an outage but finally I got 2 controllers to have an outage.
On one I get the log message:

Code: Select all

[mqtt_as] wait_msg sock took 4294967
This would have been the outage time but sadly it was reported during a time the controller worked perfectly fine, making me wonder where it comes from..

After the outage I only get these logs on both controllers:

Code: Select all

[machine.debug] Diff to last log not within 10%: 4388.04, expected 600
[machine.debug] Diff to last log not within 10%: 4421.13, expected 600
They indicate that for ~1h 10 minutes there was no code executed.

But as I actually wrapped all socket operations in time recording functions resulting in a log like above, there is now no indication of sockets being the problem of the outages. Which again leaves me completeley clueless as no coroutine took too long either.

My only idea now is that the poller.poll operation has to be stuck somewhere. Or does anyone have another idea?

Guess I just have to use a software watchdog to make this outages as short as possible.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Asyncio gets stuck periodically

Post by pythoncoder » Thu Jun 21, 2018 7:48 am

As you know the asynchronous MQTT code uses nonblocking sockets. As such, it doesn't use the uasyncio I/O mechanism embodied in the StreamReader and StreamWriter classes. So, unless your application uses it, this line should always return an empty tuple after waiting for the required delay. You could instrument this: there might just be some weird counter-rollover type of bug here.

Another thought is that whatever external circumstance is provoking this 70 minute outage is actually locking up the vendor code and preventing the MicroPython VM from running code for the duration.

One way to gather evidence for this proposition might be to make the main uasyncio loop update a counter, and periodically publish the counter value. Clearly if the VM were blocked for 70 minutes this would become apparent. The same symptom would occur if a coroutine blocked, but you seem to have established that this is not occurring.

I appreciate I originally counselled against instrumenting uasyncio on the assumption that it was my code which was at fault, but this does seem to be getting deeply strange.
Peter Hinch
Index to my micropython libraries.

Post Reply