Calling experienced network programmers
Posted: Sun Oct 16, 2016 9:26 am
I should make it clear that I have no experience of network programming. I've used sockets for inter-process communication on a common machine, but never across a LAN. This seems to be a very different kettle of fish. I'm looking for answers to a rather general question: whether the problems I'm experiencing are implicit to the discipline, specific to WiFi LAN's, or specific to the ESP8266.
In programming the Pyboard you can be confident that subject to well known limitations (interrupts, heap fragmentation) the system is deterministic. With some know-how you can write deterministic applications. My experience with the ESP8266 suggests the contrary.
Three months ago I started a project aimed at bringing MQTT to MicroPython platforms lacking a network interface. The concept was that you'd load an ESP8266 with a firmware build containing my code, install a module on the host platform, link the host to the ESP with a few wires and have ready access to MQTT. It has a few bells and whistles such as setting the Pyboard RTC from an NTP server. I thought this would be easy, and in a sense it was. I got it almost working quite quickly. But the problems of making it work reliably now have me contemplating abandoning the thing in favour of something I can actually achieve. In essence it will work for hours before some disruption on the WiFi causes it to fail. Note that this WiFi network works fine in normal use with PC's and tablets. And I'm not looking for "industrial strength" reliability, just something I could trust for 24/7 domestic use and something I might put on GitHub and advocate for others. Three months on I have not achieved this.
In my testing it seems that every possible network operation can fail. Network connectivity comes and goes. But even when connectivity is present the following conditions occur. Invariably on very rare occasions, making testing and debugging a time consuming task. Take it as read that the following are rare events, typically occurring after hours of testing. Some may have occurred only a handful of times.
Blocking sockets can block forever. Ping responses from the server can take many seconds to appear and sometimes are lost. The official ntptime module returns invalid data. MQTT packets sent with qos == 1 fail to receive a PUBACK. The ESP8266 falls over (see footnote). It seems that every network operation needs to be conducted with sockets with a timeout and suitable error trapping. The only way I've achieved remotely reliable operation is for the host platform to detect a timeout or responses of gibberish from the ESP and hardware reset the ESP8266, a draconian solution. And there is the vexed issue of how long timeouts should be, especially if the MQTT server is on the internet rather than on a LAN.
After three months I'm still not decided on the fundamental architecture. It has become evident that the host must detect failure of the ESP. The question is how much effort to put in to making the ESP code as reliable as possible (difficult) or whether to simply let it fail and let the host trap the event.
Perhaps the sound I can hear in the background is not rain. Maybe it's the sound of seasoned network programmers saying "welcome to our world!". Perhaps a case of an old fool rushing in where angels fear to tread?
footnote: The ESP8266 doesn't actually crash. The Python code continues to run but data corruption has occurred. While this could be a bug in my code, I have grave doubts.
In programming the Pyboard you can be confident that subject to well known limitations (interrupts, heap fragmentation) the system is deterministic. With some know-how you can write deterministic applications. My experience with the ESP8266 suggests the contrary.
Three months ago I started a project aimed at bringing MQTT to MicroPython platforms lacking a network interface. The concept was that you'd load an ESP8266 with a firmware build containing my code, install a module on the host platform, link the host to the ESP with a few wires and have ready access to MQTT. It has a few bells and whistles such as setting the Pyboard RTC from an NTP server. I thought this would be easy, and in a sense it was. I got it almost working quite quickly. But the problems of making it work reliably now have me contemplating abandoning the thing in favour of something I can actually achieve. In essence it will work for hours before some disruption on the WiFi causes it to fail. Note that this WiFi network works fine in normal use with PC's and tablets. And I'm not looking for "industrial strength" reliability, just something I could trust for 24/7 domestic use and something I might put on GitHub and advocate for others. Three months on I have not achieved this.
In my testing it seems that every possible network operation can fail. Network connectivity comes and goes. But even when connectivity is present the following conditions occur. Invariably on very rare occasions, making testing and debugging a time consuming task. Take it as read that the following are rare events, typically occurring after hours of testing. Some may have occurred only a handful of times.
Blocking sockets can block forever. Ping responses from the server can take many seconds to appear and sometimes are lost. The official ntptime module returns invalid data. MQTT packets sent with qos == 1 fail to receive a PUBACK. The ESP8266 falls over (see footnote). It seems that every network operation needs to be conducted with sockets with a timeout and suitable error trapping. The only way I've achieved remotely reliable operation is for the host platform to detect a timeout or responses of gibberish from the ESP and hardware reset the ESP8266, a draconian solution. And there is the vexed issue of how long timeouts should be, especially if the MQTT server is on the internet rather than on a LAN.
After three months I'm still not decided on the fundamental architecture. It has become evident that the host must detect failure of the ESP. The question is how much effort to put in to making the ESP code as reliable as possible (difficult) or whether to simply let it fail and let the host trap the event.
Perhaps the sound I can hear in the background is not rain. Maybe it's the sound of seasoned network programmers saying "welcome to our world!". Perhaps a case of an old fool rushing in where angels fear to tread?
footnote: The ESP8266 doesn't actually crash. The Python code continues to run but data corruption has occurred. While this could be a bug in my code, I have grave doubts.