debug challenge (SOLVED)

All ESP8266 boards running MicroPython.
Official boards are the Adafruit Huzzah and Feather boards.
Target audience: MicroPython users with an ESP8266 board.
Post Reply
pthwebdev
Posts: 8
Joined: Tue Jul 02, 2019 1:49 pm

debug challenge (SOLVED)

Post by pthwebdev » Tue Jul 02, 2019 2:41 pm

Excuses for a long description, but it is a rather difficult issue to solve and I want to show all the steps I have already taken.

On an esp8266 (a Wemos D1 arduino form factor) I have a chip connected that uses SPI. I can use HSPI. Exchange of information for long periods of time is not a problem.

However, when producing data, the chip generates interrupts. I have set up a handler and I use schedule to do the actual processing. Since one object is registered for callbacks on another object, this is properly referenced. I make sure that the handler is not re-entered (even when this is, given the timing, extremely unlikely). I use disable_irq and enable_irq around the call to schedule the callback and there is no other code in between the disabling and re-enabling the irq.
When reading data, all buffers are preallocated. So, there is hardly any change in free memory.
Since I compiled the firmware myself, most code is available as frozen modules, leaving about 25K of RAM (which is almost all of it). During operation, this hardly changes. It is not leaking memory. It is not a stack that runs out of control.

I have set micropython.alloc_emergency_exception_buf(1000), which should be plenty

The problem is that it does not run reliably. It can crash after 50, 150, 1500 and on rare occasions after 3500 or even 5000 interrupts, but it *will* crash.

To make sure that it is not debug information being sent to the serial port that consumes to much cpu, I only write a counter and only once every ten interrupts.

Most often, I can not see what caused the crash. Sometimes it is an illegal instruction, but not always.

The speed of the interrupt does not matter. From 100's per second to once per second, it is just the duration that changes. The interrupt frequence seems to be of little consequence.

Changing the cpu clock to 160MHz does not make a difference. This should be equivalent of lowering the frequence of the interrupts and I have lowered that way, way down.

I have the same setup using an ESP32 (ESPDUINO form factor; I happened to receive the boards this mornings, so I am just now trying this out; I already had other ESP32 boards, so compiling the firmware myself was already possible). This setup is still running, but it has already handled well over 750.000 interrupts at quite a high rate (250/second). But then, the ESP32 is a different beast altogether. It just tells me that it is unlikely that my code causes the problem.

I am affraid that is some sort of race condition or that background handling of the WiFi stack interferes.
Perhaps the processor needs to do a swap out of code and the interrupt interferes. Or having irq disabled interferes with the swap. I am not sure if I can increment the amount the ESP8266 reserves as code cache. Since I have plenty of RAM to spare, this could be a way forward.

I don't think it is the WDT. I have my own firmware (neither MicroPython not Arduino) for the ESP8266 and I understand how this works. I am not certain on when MicroPython feeds the watchdog, but there should be plenty of opportunity.

Maybe this is a issue with schedule. I have no idea what happens if schedule is called when another call is already scheduled or when this scheduled call is in progress when the next is scheduled.

I have no idea how to even begin looking for the cause.
I would like to have this running on the ESP8266, although changing to ESP32 is an option.

Anyone any ideas?

Thanks in advance.

---- UPDATE ----

It seems to have been the schedule function.

With the ESP32 as comparison (I aborted the test at 3M interrupts), I pushed that set up a little harder and got a crash that, very readably, said that the schedule stack had overflown.
As a solution, instead of calling schedule every interrupt, I keep a counter. Is the counter zero, then I call schedule. Otherwise, I just increment the counter.

Since it is always the same callback that I am scheduling, this callback now does a while loop and decrements the counter.

The test with the ESP8266 is now over 200K interrupts at 200 interrupts a second, and is still going.

I need to increase the speed it can handle, but at least it does not crash. Pfew.

Perhaps a helpful issue to have raised.

Thanks for your attention.

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: debug challenge (SOLVED)

Post by jimmo » Wed Jul 03, 2019 6:40 am

Glad you solved it, and your solution sounds good.

Some questions/notes though:
- On ESP8266, pin interrupts by default are dispatched via schedule anyway. (i.e. pin.irq() defaults to hard=False). Are you settings hard=True?
- You shouldn't need to disable interrupts to call schedule? I'm curious did you see something that suggested this was necessary (or did you see any benefits from doing this).

pthwebdev
Posts: 8
Joined: Tue Jul 02, 2019 1:49 pm

Re: debug challenge (SOLVED)

Post by pthwebdev » Wed Jul 03, 2019 7:58 am

Thanks for the thoughts/notes.

I guess the main reason is unfamiliarity with MicroPython. I have mainly worked on my own firmware. Deferring handling is just a common practise. I was aware of the schedule function and did not think to try working without it.

I was unaware of the "hard" parameter.

I'll see what happens when I refactor the code.

Again, thanks

Post Reply