Page 1 of 4

Calling experienced network programmers

Posted: Sun Oct 16, 2016 9:26 am
by pythoncoder
I should make it clear that I have no experience of network programming. I've used sockets for inter-process communication on a common machine, but never across a LAN. This seems to be a very different kettle of fish. I'm looking for answers to a rather general question: whether the problems I'm experiencing are implicit to the discipline, specific to WiFi LAN's, or specific to the ESP8266.

In programming the Pyboard you can be confident that subject to well known limitations (interrupts, heap fragmentation) the system is deterministic. With some know-how you can write deterministic applications. My experience with the ESP8266 suggests the contrary.

Three months ago I started a project aimed at bringing MQTT to MicroPython platforms lacking a network interface. The concept was that you'd load an ESP8266 with a firmware build containing my code, install a module on the host platform, link the host to the ESP with a few wires and have ready access to MQTT. It has a few bells and whistles such as setting the Pyboard RTC from an NTP server. I thought this would be easy, and in a sense it was. I got it almost working quite quickly. But the problems of making it work reliably now have me contemplating abandoning the thing in favour of something I can actually achieve. In essence it will work for hours before some disruption on the WiFi causes it to fail. Note that this WiFi network works fine in normal use with PC's and tablets. And I'm not looking for "industrial strength" reliability, just something I could trust for 24/7 domestic use and something I might put on GitHub and advocate for others. Three months on I have not achieved this.

In my testing it seems that every possible network operation can fail. Network connectivity comes and goes. But even when connectivity is present the following conditions occur. Invariably on very rare occasions, making testing and debugging a time consuming task. Take it as read that the following are rare events, typically occurring after hours of testing. Some may have occurred only a handful of times.

Blocking sockets can block forever. Ping responses from the server can take many seconds to appear and sometimes are lost. The official ntptime module returns invalid data. MQTT packets sent with qos == 1 fail to receive a PUBACK. The ESP8266 falls over (see footnote). It seems that every network operation needs to be conducted with sockets with a timeout and suitable error trapping. The only way I've achieved remotely reliable operation is for the host platform to detect a timeout or responses of gibberish from the ESP and hardware reset the ESP8266, a draconian solution. And there is the vexed issue of how long timeouts should be, especially if the MQTT server is on the internet rather than on a LAN.

After three months I'm still not decided on the fundamental architecture. It has become evident that the host must detect failure of the ESP. The question is how much effort to put in to making the ESP code as reliable as possible (difficult) or whether to simply let it fail and let the host trap the event.

Perhaps the sound I can hear in the background is not rain. Maybe it's the sound of seasoned network programmers saying "welcome to our world!". Perhaps a case of an old fool rushing in where angels fear to tread?

footnote: The ESP8266 doesn't actually crash. The Python code continues to run but data corruption has occurred. While this could be a bug in my code, I have grave doubts.

Re: Calling experienced network programmers

Posted: Sun Oct 16, 2016 10:55 am
by chrisgp
The problems you describe are true for network programming in general but depending on the circumstances they might not be as common. With Wi-Fi you'll have a greater chance of disruption in the network compared to something like a wired connection where you might not practically notice the problems. Additionally, while the networking stacks do their best to hide network instability from your program (by retransmitting when using TCP for example), the more resource constrained ESP8266 might not be able to handle adverse situations quite as well.

I work on some server software where it has definitely been important to test for subtle networking issues in order to prevent failure even though it is a wired/reliable network. Beyond testing/tweaking the application and operation system, we run redundant servers and the application is coded to switch to a different server if it can't talk to one temporarily.

With your project I think I would continue with a reasonable attempt at ensuring resiliency on the ESP8266 itself -- ensuring blocking calls all have timeouts, and that logically the application handles calls returning failures -- but in the end I think the PyBoard monitoring the health of the ESP8266 and rebooting it is very reasonable.

If you haven't already, you might want to do testing where you run your application and create various network scenarios...for example doing things like powering down your access point so the ESP8266 temporarily disconnects from the wifi network, leaving the access point powered on but unplugging the WAN connection so the ESP8266 maintains a Wi-Fi connection but can't route traffic, and partially degrading the Wi-Fi signal quality by covering the device with something like aluminum foil. Those are completely valid tests and you should expect your application to continue running perfectly with the exception that functionality will be degraded until your power the access point back on, etc.

Re: Calling experienced network programmers

Posted: Sun Oct 16, 2016 12:12 pm
by slzatz
I realize this is extremely elementary but the stability of my esp8266 wifi connection to run an MQTT client seems very much related to how frequently and how long I yield back to the RTOS (with time.sleep(..) or whatever). It would seem like an interesting experiment to vary both and see how that affects stability although it would presumably vary based on the performance of the underlying wifi network. You've probably already played with this but if you haven't, I would systematically vary both the periodicity of the yields and their length and see how that affects performance.

Re: Calling experienced network programmers

Posted: Sun Oct 16, 2016 1:05 pm
by pythoncoder
@chrisgp Thanks for the pointers. I've been doing all the tests you suggest except the WAN test: all of my testing to date has been with a server on my LAN. Oddly downing the WiFi completely is one of the easier scenarios to handle. The tough ones are the very rare failures when the WiFi appears to be up and running. I'm beginning to feel that this stuff is best left to the experts...

@slzatz Thanks for that, an interesting observation especially as I'm doing no explicit yields to the RTOS - I'm using a scheduler I originally developed for the Pyboard. Elementary it may be, but the need for that had passed me by. I'll experiment.

Re: Calling experienced network programmers

Posted: Sun Oct 16, 2016 3:21 pm
by deshipu
As I understand, the micopython interpreter yields control to the rtos automatically, without you having to do anything.

Re: Calling experienced network programmers

Posted: Sun Oct 16, 2016 4:30 pm
by slzatz
As I understand, the micopython interpreter yields control to the rtos automatically, without you having to do anything.
Happy to retest but previously, on the esp8266, if I had a while loop with no time.sleep_XX, I would lose my Wifi connection and it couldn't be re-established.

Re: Calling experienced network programmers

Posted: Sun Oct 16, 2016 6:06 pm
by pythoncoder
slzatz wrote:...Happy to retest but previously, on the esp8266, if I had a while loop with no time.sleep_XX, I would lose my Wifi connection and it couldn't be re-established.
Of the extraordinary range of errors I've encountered I haven't seen that one. I've always been able to re-establish the connection with the following caveat.

I found that sta_if.isconnected() could return True for a while after the network had gone down, so I judge its connection status by checking this four times at 100ms intervals. If it ever returns False the network is down. If it returns True every time, it doesn't need reconnecting. I reconnect by calling the MQTTClient object's connect method.

This is done by a thread which aims to preserve the WiFi connection. Once per second it checks if any other thread has reported a problem. If so it performs the above check and reconnection procedure. Once it deems the connection OK it clears down the problem flag. Inevitably this can take multiple passes and can take an arbitrarily long time, for example if the device has moved out of range or the AP has died.


Posted: Mon Oct 17, 2016 8:29 am
by pythoncoder
After much consideration I've decided to abandon this project. I've written up the technical issues here ... 829#p14829. Thanks all for your assistance.

I've tried the current libraries, and also a version modified to use sockets with a timeout. I've tried a variety of different approaches. But each time the code grows like bindweed as I deal with the various edge cases which can occur. I don't mind hacking a solution if I can see my way to a clean rewrite at the end. In this instance I can't foresee a way to produce a solution which isn't a revolting hack. Nor can I envisage a situation where every possible edge condition had been handled: my understanding of the problem domain is too limited.

Further I think the project was misconceived. My scheduler is designed to offer fast switching between threads, and the communication method I devised for linking to the Pyboard assumed this. Fast switching is simply never going to occur, since realistic timeouts on sockets are too long. You could, perhaps, get away with two seconds on a LAN, but on the internet much longer values are required. So threads can block for long periods. In my application, because of the communication method, this meant the scheduler on the Pyboard also blocking. I could have avoided this by using a UART, but this may bring its own problems with the ESP8266's only bidirectional device. It would also make development hard: my solution meant I could use the UART for debug messages. If I'd achieved a decent solution on the ESP I might have gone this route.

The long term solution is a nonblocking MQTT library using an enhanced (faster) version of uasyncio, which I gather is a work in progress. If such a library emerges, and can cope with all the tricky edge conditions, I may start from scratch one day. The one thing I've learnt from this project is that I lack the network programming experience to contribute to this. I plan to devote my energies to a project hopefully more within my capabilities.

Re: Calling experienced network programmers

Posted: Mon Oct 17, 2016 6:48 pm
by mkarliner
Well, it's not much comfort, but I can eliminate the ESP hardware
as an issue. I've had a weather station with an ESP and running
MQTT on Arduino for many months, and it's still going faultlessly.

Re: Calling experienced network programmers

Posted: Tue Oct 18, 2016 11:54 am
by torwag
@ Pythoncoder,

could you be a bit more specific what do you stop to work on exactly, I identified three main projects
1. your own scheduler
2. socket based mqtt for devices without network interface
3. work on mqtt primary on the esp8266 platform

As you know I play (with very slow progress) with mqtt on the esp8366 and we also discussed to use your scheduler. It might be important to know for the further development whether you will still work (bugfix) the scheduler.
Or we agree to go with uasync and try to make mqtt libs more robust.