_thread dies when loosing network connection and socket throws oserror - GIL deadlock?

pumelo · Post by **pumelo** » Fri May 29, 2020 9:37 pm

Well I face a strange error and don't know where to dig for a solution (ESP32 port on ESP32 WROVER)

I have a custom written event loop which has a thread worker which is used to execute blocking operations such as getaddrinfo, wlan.scan, wlan.connect, ussl.wrap and so on.

The code run in the thread can be found below. threadio.fun_arg_queue is just a regular list where functions, argument and a reference to the task is popped from.

Most likely this code does not even matter... :-/

What I see is, that when I lose the network connection and the open socket throws an OSError, my worker thread sometimes just dies even tough it was idling in the sleep loop ...
I suspect, that due to abnormal shutting down the socket or not clearing a socket from select.poll, the GIL somehow deadlocks such, that the thread does not get a chance to run anymore....

any thoughts where I should dig?

No clue if it does matter, the wlan.connect and the ussl.wrap(socket) was executed I the thread which dies ..

Code: Select all

def worker_in_thread(threadio):
    print('worker thread started ... ')
    while not threadio.shutdown:
        if len(threadio.fun_arg_queue) > 0:
            res = None
            task = None
            fn, args, task = threadio.fun_arg_queue.pop(0)
            print('from thread', fn, args, task)
            try:
                if not args:
                    res = fn()
                else:
                    res = fn(*args)
                threadio.res_dict[task] = res
            except Exception as res:
                threadio.res_dict[task] = res
            utrio.reschedule(task, None)
        else:
            utime.sleep(0.1)
        print('-')
        threadio.alive = True

tve · Post by **tve** » Sat May 30, 2020 5:04 pm

Do you have a small complete example to reproduce the problem? I'd be interested in looking into it.

pumelo · Post by **pumelo** » Sun May 31, 2020 12:59 pm

@TVE, thanks for offering your help.

Yesterday, I tried to write code to reproduce this issue from scratch. But I failed, so clearly I have not understood where the issue originates from. This evening or tomorrow I'll approach this the other way, by defeaturing my app. Once I have the minimum code to repro, I'll poste the code here.

pumelo · Post by **pumelo** » Sun May 31, 2020 9:40 pm

Okay, I would not call this a minimal repro but this is the best thing I can come up with for now.

See attachment.
- The folder utrio includes a WIP port from python trio async await lib. Docs can be found here https://trio.readthedocs.io/en/stable/.
- uamqtt is an async rewrite of the mqtt lib from https://github.com/micropython/micropyt ... mple/umqtt beeing async.
- run main.py

I could reproduce the error on the newest firmware esp32spiram-idf4-20200531-unstable-v1.12-483-g22806ed5d.bin
Sorry, this is a very noisy application with a lot of output...

In order to reproduce:
- edit the wlan ap credentials in main.py
- let it run.
- Switch the ap on and off at will
- Suddenly the worker thread won't run anymore -> observed by seeing, that the queue gets longer and longer and not debug output from the worker thread does appear anymore. Be patient, sometimes this takes minutes ...

Usually the worker thread not responding coincides with an event from the wlan stack such as STA_DISCONNECTED which will trigger exceptions in "subtask_connect" task. However I also see, that the "Crash Task" crashing can coincide with the worker thread stalling. The trio nursery will catch the exception of the first task which throws an exception. Inject a Canceled exception to the other tasks in that nursery and finally raise an exception which will propagate "upstream".

I don't have insight into the internals of exception handling. Might the cause of locking the thread originate from there?

Example log output below (watch out for "from thread" messages suddenly disappearing).

Length of Thread queue is now 1
from thread length of queue is: 1
from thread <function> (0.1,) <Task Object: sync_sleep, CPU: 0.001423109>
publishing, 44
from thread length of queue is: 0
publishing, 45
sync sleep
schedule <function> in thread from task <Task Object: sync_sleep, CPU: 0.001415517>
Length of Thread queue is now 1
from thread length of queue is: 1
from thread <function> (0.1,) <Task Object: sync_sleep, CPU: 0.001469956>
publishing, 46
from thread length of queue is: 0
publishing, 47
sync sleep
schedule <function> in thread from task <Task Object: sync_sleep, CPU: 0.001462102>
Length of Thread queue is now 1
I (260937) wifi: state: run -> init (3c0)
I (260937) wifi: pm stop, total sleep time: 28803713 us / 78385947 us

I (260937) wifi: new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
I (260947) wifi: STA_DISCONNECTED, reason:3
<Task Object: subtask_connect, CPU: 0.008480228> received exception
Traceback (most recent call last):
File "utrio/__init__.py", line 120, in resume
File "main.py", line 30, in subtask_connect
File "uamqtt.py", line 165, in publish
File "uamqtt.py", line 51, in _send_str
File "utrio/socket.py", line 185, in send_all
OSError: [Errno 113] EHOSTUNREACH
<Nursery object at 3fa9c0d0> get canceled with
Traceback (most recent call last):
File "utrio/__init__.py", line 120, in resume
File "main.py", line 30, in subtask_connect
File "uamqtt.py", line 165, in publish
File "uamqtt.py", line 51, in _send_str
File "utrio/socket.py", line 185, in send_all
OSError: [Errno 113] EHOSTUNREACH
<Task Object: subtask_connect, CPU: 0.008478289> cancel called
<Task Object: sync_sleep, CPU: 0.001515745> cancel called
<Task Object: crash, CPU: 3.876495e-06> cancel called
<Task Object: subtask_connect, CPU: 0.008477763> exit
<Task Object: sync_sleep, CPU: 0.001515639> received exception
Traceback (most recent call last):
File "utrio/__init__.py", line 120, in resume
File "main.py", line 37, in sync_sleep
File "utrio/to_thread.py", line 72, in run_sync
File "utrio/to_thread.py", line 33, in schedule_in_thread
File "utrio/__init__.py", line 185, in wait_task_rescheduled
Canceled:
<Nursery object at 3fa9c0d0> get canceled with
Traceback (most recent call last):
File "utrio/__init__.py", line 120, in resume
File "main.py", line 37, in sync_sleep
File "utrio/to_thread.py", line 72, in run_sync
File "utrio/to_thread.py", line 33, in schedule_in_thread
File "utrio/__init__.py", line 185, in wait_task_rescheduled
Canceled:
<Task Object: sync_sleep, CPU: 0.001515287> cancel called
<Task Object: crash, CPU: 3.875323e-06> cancel called
<Task Object: sync_sleep, CPU: 0.001515222> exit
<Task Object: crash, CPU: 3.875158e-06> received exception
Traceback (most recent call last):
File "utrio/__init__.py", line 120, in resume
File "main.py", line 42, in crash
File "utrio/_time.py", line 55, in sleep
File "utrio/_time.py", line 34, in delay_task
File "utrio/__init__.py", line 185, in wait_task_rescheduled
Canceled:
<Nursery object at 3fa9c0d0> get canceled with
Traceback (most recent call last):
File "utrio/__init__.py", line 120, in resume
File "main.py", line 42, in crash
File "utrio/_time.py", line 55, in sleep
File "utrio/_time.py", line 34, in delay_task
File "utrio/__init__.py", line 185, in wait_task_rescheduled
Canceled:
<Task Object: crash, CPU: 3.874302e-06> cancel called
<Task Object: crash, CPU: 3.874212e-06> exit
Canceled:
restart
try exit but not allowed
schedule <function> in thread from task <Task Object: sync_sleep, CPU: 0.0>
Length of Thread queue is now 2
I (263017) wifi: STA_DISCONNECTED, reason:201
no AP found
I (265087) wifi: STA_DISCONNECTED, reason:201
no AP found
I (267147) wifi: STA_DISCONNECTED, reason:201
no AP found
I (269217) wifi: STA_DISCONNECTED, reason:201
no AP found
I (271287) wifi: STA_DISCONNECTED, reason:201
no AP found
I (273347) wifi: STA_DISCONNECTED, reason:201
no AP found
I (275417) wifi: STA_DISCONNECTED, reason:201
no AP found

pumelo · Post by **pumelo** » Mon Jun 01, 2020 12:19 pm

BTW: One of the tests I performed was to raise the rtos task priority of the thread task by one so the threads priority is higher than the priority of the main python task. -> Same result..

pumelo · Post by **pumelo** » Tue Jun 02, 2020 9:25 am

does not seem to be an issue with the GIL.

I modified the idle function in modmachine.c to exit the gil (this might be a good thing anyway?). Calling that function in both the loops in the main thread and the worker thread. Still the worker thread dies ...

STATIC mp_obj_t machine_idle(void) {
MP_THREAD_GIL_EXIT();
taskYIELD();
MP_THREAD_GIL_ENTER();
return mp_const_none;
}
Shouldn't this be enough that the worker thread can come alive?

Are there additional mutexes which could prevent the vm in the thread worker to run?

This is the point where I don't know what to do about it and I'll just accept that threads on esp32 are broken ...

MLCarter · Post by **MLCarter** » Thu Sep 02, 2021 3:40 am

I have exactly this problem with an ESP32. I have a server running in a separate thread and sometimes NOT every time sleep in the server thread stops the measurements being taken in the main thread. This happens when WiFi down and waiting looking for it to come back up.
So I'm trying not using sleep! Any/all comments welcome note I'm a self taught hacker from the back blocks

Code: Select all

def server():
	cmds={}
	cmds=get_setup()
	while True:
		listening=False
		wifigood,rssi=wifi(cmds)
		print('Checking WiFi rssi='+str(rssi))
		if wifigood:
			s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
			s.settimeout(30.0)
			try:
				s.bind(('',1010))
				s.listen(3)
				listening=True
				print('Listening on 1010')
			except:
				print('Socket bind error wait 30')
				pause(30)
			while listening:
				try:
					conn,addr=s.accept()
					print('Connection from '+str(addr))
					conn.send('HTTP/1.1 200 OK\n')
					conn.send('Content-Type: text/html\n')
					conn.send('Connection: close\n\n')
					html=web_page()
					conn.sendall(html)
					request=conn.recv(1024)
					conn.close()
				except OSError as e:
					x=e.args[0]
					print('SE'+str(e))
					if x!=116 and x!=104:
						print('Error not 104 or 116')
						listening=False
			print('Close socket')
			s.close()
		else:
			print('No WiFi wait 5m and Retry')
			pause(300)
	return

def pause(duration):
	t0=time.time()
	while time.time()-t0<duration:
		pass
	return

MicroPython Forum (Archive)

_thread dies when loosing network connection and socket throws oserror - GIL deadlock?

_thread dies when loosing network connection and socket throws oserror - GIL deadlock?

Re: _thread dies when loosing network connection and socket throws oserror - GIL deadlock?

Re: _thread dies when loosing network connection and socket throws oserror - GIL deadlock?

Re: _thread dies when loosing network connection and socket throws oserror - GIL deadlock?

Re: _thread dies when loosing network connection and socket throws oserror - GIL deadlock?

Re: _thread dies when loosing network connection and socket throws oserror - GIL deadlock?

Re: _thread dies when loosing network connection and socket throws oserror - GIL deadlock?