Ticket UUID: | d481f15516a465b1ce796fae7d54a2be3e384a45 | |||
Title: | Closing a transferred channel throws "FlushChannel: damaged channel list" | |||
Type: | Bug | Version: | trunk (3.0a1) | |
Submitter: | anonymous | Created on: | 2019-12-27 11:20:54 | |
Subsystem: | 80. Thread Package | Assigned To: | nobody | |
Priority: | 5 Medium | Severity: | Severe | |
Status: | Open | Last Modified: | 2020-01-10 12:20:46 | |
Resolution: | None | Closed By: | nobody | |
Closed on: | ||||
Description: |
To prevent DNS lookups blocking my application, I connect sockets in a separate thread. Once the name resolution part is done, I detach the socket from the helper thread and attach it to the main thread. Certain operations performed later on the socket (like closing it) result in a crash, reporting: FlushChannel: damaged channel list. I first encountered this on Tcl 8.6.10 with Thread 2.8.5, but can reproduce it with Tcl 8.6.0/Thread 2.7.0, as well as Tcl trunk/Thread trunk. | |||
User Comments: |
sbron added on 2020-01-10 12:20:46:
I went ahead and opened f583715154 against the Tcl core anonymous added on 2020-01-05 14:52:09: I did some more debugging and found that the problem is that the fileevent handler runs in the wrong thread. This happens because the code uses an internal file handler for an async socket. Any user defined file handlers are cached until the internal handler has processed the initial connect. It then installs the cached user handlers. But that happens in the thread that did the connect, not necessarily the thread that manages the socket and where the user handlers were defined. Then when the event for the user handler happens, that handler also runs in the thread that did the connect. The Thread extension does a Tcl_ClearChannelHandlers(), but that does not clear the internal handler. There doesn't seem to be another way for an extension to clear the internal handler either. So I agree that this is not something that can be fixed in the Thread package. I have created a patch (attached) that seems to fix the issue. The way it works is that Tcl_CutChannel() sets the managingThread attribute of the channel to NULL (because after cutting the channel, it is no longer managed by ant thread). It then triggers the watch procedure of the channel. Upon detecting that there is no managingThread for the channel, the watch proc cancels the internal handler and clears the TCP_ASYNC_PENDING flag. When Tcl_SpliceChannel() is called, it finds a channel with TCP_ASYNC_CONNECT set, but TCP_ASYNC_PENDING cleared. This indicates that the channel is doing an async connect, but no handler is in place to watch for the result. So, Tcl_SpliceChannel() reinstalls the internal handler. Because it is now installed in the thread that manages the channel, the callback will happen in the correct thread. I don't know if setting managingThread to NULL may cause problems elsewhere. If so, the test suite doesn't test that scenario. It still runs without errors after my patch. I also don't know if the problem existsed on Windows and/or if any changes are needed there to work with my patch. To me this looks like a different issue than [815e246806]. Should I open a new ticket against Tcl and propose the patch there? sebres added on 2019-12-27 13:52:31:
I don't think it is an issue of thread module.
Looks like already known bug in tcl IO-subsystem - [815e246806]. Anyway I'm pretty sure, this has nothing to do with the thread package. |