ESP32 reliability

All ESP32 boards running MicroPython.
Target audience: MicroPython users with an ESP32 board.
User avatar
pythoncoder
Posts: 4678
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: ESP32 reliability

Post by pythoncoder » Thu Aug 06, 2020 9:07 am

The 5V line shows about 200mV peak to peak of noise. The 3.3V line shows -70mV blips of 100μs duration. If I set the scope to trigger at -125mV it never triggers. This seems pretty clean.

Power problems seem unlikely as the cause of the reboots. The test sends bursts of messages every five seconds while also receiving messages. Reception also involves transmission, in part because of my protocol but also that of TCP/IP. So there is a lot of transmission going on. By contrast in my first test two reboots occurred on one night in ten days of running.

I suspect that the reboots may be a rare response of FreeRTOS to WiFi outages, but I have no idea how one might prove or disprove this assertion.
Peter Hinch

User avatar
Mike Teachman
Posts: 94
Joined: Mon Jun 13, 2016 3:19 pm
Location: Victoria, BC, Canada

Re: ESP32 reliability

Post by Mike Teachman » Sun Aug 09, 2020 4:15 am

Voltage dips sure don't appear to be a factor at all.

Is there any merit in trying to get JTAG debugging working, to identify the code that is running when the crash happens? I've never attempted source level debugging with the ESP32.

User avatar
pythoncoder
Posts: 4678
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: ESP32 reliability

Post by pythoncoder » Sun Aug 09, 2020 7:53 am

Doubtless. But well beyond my pay grade I'm afraid ;)

The IDF4 test has now run for its target of ten days with the following outcome:
WiFi outages: 126
Reboots: 6
No missed or duplicate messages.

A longer test would be required to be sure, but it would seem that IDF3 is better in regard to reboots (2 vs 6).
Peter Hinch

User avatar
pythoncoder
Posts: 4678
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

ESP32 reliability. Final conclusions?

Post by pythoncoder » Sun Aug 09, 2020 4:59 pm

I started a new test running a less demanding test script. This restricts communication to half-duplex: the sender waits for an ACK from the receiver before sending a new message. I used IDF3 on the theory that it might be less prone to reboots.

In four hours running it has rebooted twice, so I'm terminating the test. My observations and conclusions are as follows:
  • Applications doing bidirectional communications can expect occasional spontaneous reboots.
  • These may or may not be associated with WiFi outages.
  • The interval between these is highly variable from hours to many days.
  • IDF version is not a factor.
  • I believe my test setup eliminates PSU problems.
  • My test setup uses the reference board without SPIRAM.
  • My test location is prone to WiFi outages. This is deliberate as my aim is to achieve communications which survive these.
  • The Pyboard D has run this, and the more demanding test, faultlessly through multiple outages.
Peter Hinch

kevinkk525
Posts: 782
Joined: Sat Feb 03, 2018 7:02 pm

Re: ESP32 reliability

Post by kevinkk525 » Mon Aug 10, 2020 4:57 am

I suspect that there are still some strange bugs in IDF3 and IDF4 (or micropython esp32 port) because my board often crashes with a cpu crash leaving a stacktrace, at least when running my own project. Haven't analyzed that one yet so I just let the esp crash once a day which is fine in my use-case.
Kevin Köck
Micropython Smarthome Firmware (with Home-Assistant integration): https://github.com/kevinkk525/pysmartnode

User avatar
pythoncoder
Posts: 4678
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: ESP32 reliability

Post by pythoncoder » Mon Aug 10, 2020 5:26 am

Interesting. Perhaps it depends on use patterns. I have never seen a crash except on SPIRAM boards.
Peter Hinch

User avatar
tve
Posts: 214
Joined: Wed Jan 01, 2020 10:12 pm
Location: Santa Barbara, CA
Contact:

Re: ESP32 reliability

Post by tve » Mon Aug 10, 2020 7:07 am

I just looked, my "reliability test" board (which I more or less forgot about) has been up for >5 weeks and has clocked >3 million MQTTS messages. Care to try my fork of MP? https://github.com/tve/micropython/rele ... v1.12-tve2
I do see the MQTTS connection break periodically and wanted to investigate, but haven't found the energy for that...

User avatar
pythoncoder
Posts: 4678
Joined: Fri Jul 18, 2014 8:01 am
Location: UK
Contact:

Re: ESP32 reliability

Post by pythoncoder » Tue Aug 11, 2020 6:20 am

@tve Test now running with your build.
Peter Hinch

User avatar
tve
Posts: 214
Joined: Wed Jan 01, 2020 10:12 pm
Location: Santa Barbara, CA
Contact:

Re: ESP32 reliability

Post by tve » Tue Aug 11, 2020 7:49 am

Cool. My build disables bluetooth and that frees up a lot of esp-idf heap memory. But that doesn't seem to be the issue with the TLS disconnects I'm seeing (just disconnects, not resets). I'm trying to enable more debug options to see whether I can get esp-idf to print something...

cgtan2020
Posts: 7
Joined: Thu May 28, 2020 7:53 am

Re: ESP32 reliability

Post by cgtan2020 » Tue Aug 11, 2020 8:19 am

tve wrote:
Mon Aug 10, 2020 7:07 am
I just looked, my "reliability test" board (which I more or less forgot about) has been up for >5 weeks and has clocked >3 million MQTTS messages. Care to try my fork of MP? https://github.com/tve/micropython/rele ... v1.12-tve2
I do see the MQTTS connection break periodically and wanted to investigate, but haven't found the energy for that...
I am trying your firmware but it give this error after my UDP loop run for about 10 minutes.

>>>>>>>>>

Guru Meditation Error: Core 1 panic'ed (Unhandled debug exception)
Debug exception reason: BREAK instr
Core 1 register dump:
PC : 0x400803c0 PS : 0x00060436 A0 : 0x00000000 A1 : 0x3ffd0270
A2 : 0x3ffc1da4 A3 : 0x3ffbd790 A4 : 0x00000004 A5 : 0x3ffe68bc
A6 : 0x00000017 A7 : 0x00000000 A8 : 0x00060023 A9 : 0x00000000
A10 : 0x00000000 A11 : 0x3ffc2914 A12 : 0x800992b8 A13 : 0x3ffda870
A14 : 0x00000003 A15 : 0x00060023 SAR : 0x0000001a EXCCAUSE: 0x00000001
EXCVADDR: 0xffffffd0 LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x400803bd:0x3ffd0270

Rebooting...
ets Jun 8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5132
load:0x40078000,len:12944
load:0x40080400,len:3472
entry 0x40080638
I (566) cpu_start: Pro cpu up.
I (566) cpu_start: Application information:
I (566) cpu_start: Compile time: May 25 2020 23:42:57
I (570) cpu_start: ELF file SHA256: 0000000000000000...
I (576) cpu_start: ESP-IDF: v4.0
I (580) cpu_start: Starting app cpu, entry point is 0x40082b6c
I (573) cpu_start: App cpu up.
I (591) heap_init: Initializing. RAM available for dynamic allocation:
I (598) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM
I (604) heap_init: At 3FFB6388 len 00001C78 (7 KiB): DRAM
I (610) heap_init: At 3FFB9A20 len 00004108 (16 KiB): DRAM
I (616) heap_init: At 3FFBDB5C len 00000004 (0 KiB): DRAM
I (622) heap_init: At 3FFCC978 len 00013688 (77 KiB): DRAM
I (628) heap_init: At 3FFE0440 len 00003AE0 (14 KiB): D/IRAM
I (635) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
I (641) heap_init: At 4009F1F8 len 00000E08 (3 KiB): IRAM
I (647) cpu_start: Pro cpu start user code
I (666) spi_flash: detected chip: generic
I (666) spi_flash: flash io: dio
I (667) cpu_start: Starting scheduler on PRO CPU.
I (0) cpu_start: Starting scheduler on APP CPU.
I (524) modsocket: Initializing

Post Reply