Can anyone decode this ESP8266 core dump? [SOLVED] please read!

All ESP8266 boards running MicroPython.
Official boards are the Adafruit Huzzah and Feather boards.
Target audience: MicroPython users with an ESP8266 board.
User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Can anyone decode this ESP8266 core dump? [SOLVED] please read!

Post by pythoncoder » Tue Apr 21, 2020 4:29 pm

Code: Select all

Fatal exception 0(IllegalInstructionCause):
epc1=0x4027edf4, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x40100000, len 31604, room 16 
tail 4
chksum 0x14
load 0x3ffe8000, len 1004, room 4 
tail 8
chksum 0xc0
load 0x3ffe83f0, len 1064, room 0 
tail 8
chksum 0x62
csum 0x62
The code which provokes it is too long to post here (~240LOC), but it runs for some time before crashing. It has a soft pin interrupt and uses I2C. MicroPython produces no traceback. Does this dump tell us anything?
Peter Hinch
Index to my micropython libraries.

User avatar
tve
Posts: 216
Joined: Wed Jan 01, 2020 10:12 pm
Location: Santa Barbara, CA
Contact:

Re: Can anyone decode this ESP8266 core dump?

Post by tve » Tue Apr 21, 2020 5:36 pm

IllegalInstruction means you're executing some garbage. I'd look into the .map file to see where 0x4027edf4 is and whether it provides any clue.
Is there any way to tell the "OS" to produce a stack dump? I seem to remember writing a tool that takes the raw stack dump and tries to construct a backtrace from it by finding words that look like they point to code and then running gdb to get the source line. But this was using the "nonos-sdk".

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: Can anyone decode this ESP8266 core dump?

Post by jimmo » Wed Apr 22, 2020 3:05 am

pythoncoder wrote:
Tue Apr 21, 2020 4:29 pm
Does this dump tell us anything?
As TvE says, you'd need the exact .map file (or better, the .elf file) that the firmware bin was built with. If it's release 1.12 I can probably see if Damien has it. Unfortunately if I build locally I won't necessarily get the same layout, but if you've compiled it locally and have the .bin and the .elf/.map then that would be useful.

(Edit, just realise the elf and map files are on the downloads page)

Which version are you running, it seems likely that it's probably esp8266-20200421-v1.12-388-g388d419ba.bin? That PC address points to 8 bytes of static const data at the end of pin_init0. Which would suggest that it's somehow running past the ret.n immediately before it. Maybe some sort of race with a double-interrupt?

If you can let me know exactly the version I can dig into it some more.

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Can anyone decode this ESP8266 core dump?

Post by pythoncoder » Wed Apr 22, 2020 6:11 am

Firmware was built from source. The elf file is here: http://hinch.me.uk/firmware.elf.

I'm communicating with a Pyboard which is using I2C slave mode. This works for varying periods - hundreds of transfers in each direction - before failing, usually with an I2C timeout but sometimes with an ESP8266 crash. There is nothing "clever" about the ESP code - it's pure MicroPython/uasyncio V3 - no Viper, Native or Asm. The IRQ and I2C are soft.

I was hoping the dump might give me a clue as to what's going on.
Peter Hinch
Index to my micropython libraries.

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Solved, but please read!

Post by pythoncoder » Wed Apr 22, 2020 3:07 pm

tl;dr Another manifestation of issue 5714. The code works if I regularly ping the ESP8266.

The long version: I have spent the last few days chasing what seemed to be a hardware fault, but one which only occurred quite rarely. The symptom was I2C timeouts and occasional illegal instruction crashes. I guess I should have got there sooner, having reported the bug in the first place but it was a pig to find. Part of the problem was that it wasn't evident which end of the I2C link was the culprit.

In my view this bug is quite serious. It's a clear regression: I tested this code extensively with a pass threshold of 10K messages. After minor mods to port it to uasyncio V3 it failed in a way which seemed inexplicable.
Peter Hinch
Index to my micropython libraries.

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: Can anyone decode this ESP8266 core dump? [SOLVED] please read!

Post by jimmo » Wed Apr 22, 2020 3:18 pm

FWIW, that PC address in your firmware.elf is exactly the address of pin_intr_handler

Code: Select all

...
4027edee:	4029      	s32i.n	a2, a0, 16
4027edf0:	ff84f0        	excw
4027edf3:	3f          	.byte 0x3f

4027edf4 <pin_intr_handler>:
4027edf4:	d0c112        	addi	a1, a1, -48
4027edf7:	a1c9      	s32i.n	a12, a1, 40
4027edf9:	91d9      	s32i.n	a13, a1, 36
4027edfb:	81e9      	s32i.n	a14, a1, 32
4027edfd:	b109      	s32i.n	a0, a1, 44
...

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Interesting similarity - regression?

Post by pythoncoder » Wed Apr 22, 2020 6:51 pm

That is interesting. My original bug report was also for code with a pin interrupt: in that case a hard IRQ, in this one it's soft. However the commonest failure mode in the new code is an I2C timeout: an exception is thrown but Python is still running. Hopefully this info will help you guys to track it down.

I'm almost certain it's a regression. My testing ~2 years ago didn't involve any WiFi access or pinging. I can't be absolutely sure that WiFi had been initialised - but my usual approach is to run my do_connect script after a flash erasure.
Peter Hinch
Index to my micropython libraries.

User avatar
tve
Posts: 216
Joined: Wed Jan 01, 2020 10:12 pm
Location: Santa Barbara, CA
Contact:

Re: Can anyone decode this ESP8266 core dump? [SOLVED] please read!

Post by tve » Wed Apr 22, 2020 10:44 pm

Perhaps you could find out whether it's a regression by trying an old firmware version?

May I ask: at what point is supporting/using the esp8266 (with MP) counter-productive in your opinion? I don't want to sound like an esp8266 hater, I am not, but as far as I can tell using it with MP is not a pleasant experience. The two tough questions I would ask: (1) how many new users have been turned off MP because they started with an esp8266 and just got frustrated? (2) how many features did not get developed because time was spent on the esp8266? And I'd add a personal one: do you really like battling the esp8266 for hours on end in order to save the price of a coffee or beer?

If there were plenty of maintainers around and people dedicated to working on the esp866 port I wouldn't ask these questions. I'm just looking at it from a practical limited resource point of view and sometimes one has to ask tough questions (and sometimes the answer is "because" and that may be ok).

User avatar
pythoncoder
Posts: 5956
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: Can anyone decode this ESP8266 core dump? [SOLVED] please read!

Post by pythoncoder » Thu Apr 23, 2020 5:02 am

tve wrote:
Wed Apr 22, 2020 10:44 pm
Perhaps you could find out whether it's a regression by trying an old firmware version?
If this would help the maintainers I could do this but I'd have to revert my code and uasyncio too.
tve wrote:
Wed Apr 22, 2020 10:44 pm
May I ask: at what point is supporting/using the esp8266 (with MP) counter-productive in your opinion? ...
Reasons for:
Using MP on ESP8266 is OK. If you're prepared to build firmware quite substantial applications will run on it. Some hackable commercial products use it (Sonoff kit for example). People still have ESP8266 hardware which it seems a shame to discard.

Reasons against (additional to your points):
Its realtime performance is lousy, but the same goes for the ESP32. I'm still to be convinced about the capability of either for long term running without a WDT.

For serious professional class applications the Pyboard D is the only solution (IMO). It wins hands down with excellent realtime performance and rock-solid reliability. But its price is too high for some.
Peter Hinch
Index to my micropython libraries.

Damien
Site Admin
Posts: 647
Joined: Mon Dec 09, 2013 5:02 pm

Re: Can anyone decode this ESP8266 core dump? [SOLVED] please read!

Post by Damien » Thu Apr 23, 2020 5:38 am

The fact that it crashes with an IllegalInstructionCause right at pin_intr_handler is very good evidence that the crash is because pin_intr_handler is in flash and it cannot read flash because the hard GPIO IRQ has interrupted an ongoing flash operation.

Note that pin_intr_handler is called by pin_intr_handler_iram, which does live in iRAM and so that function is at least being called, adding to the evidence that it's because pin_intr_handler lives in flash.

The thing is, pin_intr_handler has always lived in flash (since v1.8, 4 years ago), so this problem has always been around. It may be that it's only surfacing now because flash has a certain combination of functions in it which means that pin_intr_handler keeps getting evicted from the iRAM cache, whereas before it stayed in the cache and could execute without crashing.

Anyway, it really should be that pin_intr_handler is put in iRAM permanently, because it's called on an IRQ. That means all the functions pin_intr_handler calls should also be put in iRAM. That in turn means hard Python IRQs are no longer possible (otherwise the VM needs to be in iRAM and there's not enough iRAM for that).

I've been trying to reproduce a similar crash as reported here, but cannot. So what I'd suggest is that I make a patch where pin_intr_handler and everything it calls goes in iRAM, then @pythoncoder could test it.

Regarding the I2C timeout: this seems like an independent problem, and I don't have any ideas about it yet.

Post Reply