Tcl package Thread source code: View Ticket

Ticket UUID:	d481f15516a465b1ce796fae7d54a2be3e384a45
Title:	Closing a transferred channel throws "FlushChannel: damaged channel list"
Type:	Bug	Version:	trunk (3.0a1)
Submitter:	anonymous	Created on:	2019-12-27 11:20:54
Subsystem:	80. Thread Package	Assigned To:	nobody
Priority:	5 Medium	Severity:	Severe
Status:	Open	Last Modified:	2020-01-10 12:20:46
Resolution:	None	Closed By:	nobody
		Closed on:
Description:	To prevent DNS lookups blocking my application, I connect sockets in a separate thread. Once the name resolution part is done, I detach the socket from the helper thread and attach it to the main thread. Certain operations performed later on the socket (like closing it) result in a crash, reporting: FlushChannel: damaged channel list. I first encountered this on Tcl 8.6.10 with Thread 2.8.5, but can reproduce it with Tcl 8.6.0/Thread 2.7.0, as well as Tcl trunk/Thread trunk.
User Comments:	sbron added on 2020-01-10 12:20:46: I went ahead and opened f583715154 against the Tcl core anonymous added on 2020-01-05 14:52:09: I did some more debugging and found that the problem is that the fileevent handler runs in the wrong thread. This happens because the code uses an internal file handler for an async socket. Any user defined file handlers are cached until the internal handler has processed the initial connect. It then installs the cached user handlers. But that happens in the thread that did the connect, not necessarily the thread that manages the socket and where the user handlers were defined. Then when the event for the user handler happens, that handler also runs in the thread that did the connect. The Thread extension does a Tcl_ClearChannelHandlers(), but that does not clear the internal handler. There doesn't seem to be another way for an extension to clear the internal handler either. So I agree that this is not something that can be fixed in the Thread package. I have created a patch (attached) that seems to fix the issue. The way it works is that Tcl_CutChannel() sets the managingThread attribute of the channel to NULL (because after cutting the channel, it is no longer managed by ant thread). It then triggers the watch procedure of the channel. Upon detecting that there is no managingThread for the channel, the watch proc cancels the internal handler and clears the TCP_ASYNC_PENDING flag. When Tcl_SpliceChannel() is called, it finds a channel with TCP_ASYNC_CONNECT set, but TCP_ASYNC_PENDING cleared. This indicates that the channel is doing an async connect, but no handler is in place to watch for the result. So, Tcl_SpliceChannel() reinstalls the internal handler. Because it is now installed in the thread that manages the channel, the callback will happen in the correct thread. I don't know if setting managingThread to NULL may cause problems elsewhere. If so, the test suite doesn't test that scenario. It still runs without errors after my patch. I also don't know if the problem existsed on Windows and/or if any changes are needed there to work with my patch. To me this looks like a different issue than [815e246806]. Should I open a new ticket against Tcl and propose the patch there? sebres added on 2019-12-27 13:52:31: I don't think it is an issue of thread module. Looks like already known bug in tcl IO-subsystem - [815e246806]. Although this is fixed in branch fix-815e246806-8-6 several time ago, but using atomic primitives (that may be not available for some platforms), so could be rewritten using mutex locks instead of. Anyway I'm pretty sure, this has nothing to do with the thread package.

Attachments:

FlushChannelPatch.txt [download] added by anonymous on 2020-01-05 14:51:19. [details]
stacktrace.txt [download] added by anonymous on 2019-12-27 11:22:27. [details]
flushchannel.tcl [download] added by anonymous on 2019-12-27 11:21:54. [details]