Ticket UUID: | 080f846fd58958efeaa5e2247622dc0ee7fa1ff7 | |||
Title: | Channel system generating writable events BEFORE channel is open (refchan) | |||
Type: | Bug | Version: | 9.0 | |
Submitter: | apnadkarni | Created on: | 2024-03-28 11:44:21 | |
Subsystem: | 25. Channel System | Assigned To: | apnadkarni | |
Priority: | 5 Medium | Severity: | Important | |
Status: | Closed | Last Modified: | 2024-05-31 11:03:20 | |
Resolution: | Fixed | Closed By: | jan.nijtmans | |
Closed on: | 2024-05-31 11:03:20 | |||
Description: |
A script may open a non-blocking channel and register an "writable" event handler to be called when the channel is open. This worked for refchans as well in 8.6 but in 9.0, the handler is invoked BEFORE the channel is actually open. The culprit is the following code in tclIO.c
The above was added post-8.6 to The impacted case described below involves refchans because the refchan infrastructure uses timers to generate events but I suspect it could also affect ordinary channels that register timers to detect (for example) connection time outs. I ran into the above porting twapi to 9.0. Twapi's TLS sockets are implemented as refchans wrapping Tcl sockets. The issue I'm seeing is the following test case:
A simple test case is not easy to create since it involves timing (the connection should not complete before the timer) and implementation of a virtual channel but the above code seems obviously wrong to me in that it does not check whether channel is actually open. | |||
User Comments: |
jan.nijtmans added on 2024-05-31 11:03:20:
> The issue continuous generation of events of continuous generation of events was already fixed before the rollback Partly, since [70a407c2453e5aa3] was a workaround, not a fix (see Ashok's motivation in this ticket) Anyway this ticket can be closed as "Fixed" now. pooryorick added on 2024-05-31 10:42:55: Most of the examples of refchan implementation on the wiki, like [ https://wiki.tcl-lang.org/page/reflected+channel+example|this one] fail when [chan postevent] is synchronous. After the rollback we're back to the dark days for refchans. pooryorick added on 2024-05-31 10:39:55: The issue continuous generation of events of continuous generation of events was already fixed before the rollback, so the rollback was not needed to address that issue. Therefore the only issue here is change of a public command without a TIP. We fix bugs without a TIP all the time, and [chan postevent] acting in a synchronous manner is a bug. apnadkarni added on 2024-05-26 04:04:30: Back to this, now that beta 2 is out. We are pretty much at an impasse. I'm petitioning for rollback of related refchan changes. Since these impact multiple tickets, I've posted my rationale in the core wiki TL;DR (a) Changing of chan event from synchronous to asynchronous is both unnecessary and a change to a public command without a TIP, (b) continuous generation of events, irrespective of the queue, completely defeats the purpose of event-driven i/o. See link above for details. pooryorick added on 2024-04-19 08:34:54: To summarize again: 1) The "problem" described in this ticket is that a channel is being set to nonblocking before it is connected. The only thing that should be done with a channel before it is actually connected is to wait for it to connect, so the real answer to this ticket is, "don't do that." No example on the wiki has done that in the last 20 years. 2) Regardless of that argument, the code has now been changed so that setting a channel to non-blocking before it is connected now behaves as "expected". 3) After a channel event handler runs, it is necessary to somehow return again to the event loop before another handler runs so that the event loop can mediate the handling of events. That's why the timer was introduced back in the fix for [67a5eabbd3]. Without that, reentrancy bugs happen. The author of Atom recently wrote up a description of exactly the same problem in Ownership and data flow in GPUI: This simplicity, however, led to a subtle bug that went unnoticed until the code was widely used in production. The problem manifested when one listener function emitted an event to the same emitter it was subscribed to. This inadvertently triggered reentrancy, where the emitting function was called again before it had completed its execution. This recursive-like behavior contradicted our expectation of linear function execution and got us into an unexpected state. Even though JavaScript's garbage collector enforces memory safety, the language's relaxed ownership model made it easy for me to write this bug.Rust's constraints make this naive approach to rather more difficult. We're strongly encouraged down a different path, which prevents the kind of reetrancy I described above. In GPUI, when you call emit or notify, no listeners are invoked. Instead, we push data to a queue of effects. At the end of each update we flush these effects, popping from the front of the queue until it becomes empty and then returning control to the event loop. Any effect handler can itself push more effects, but the system eventually quiesces. This gives us run-to-completion semantics without reentrancy bugs and plays nicely with Rust. pooryorick added on 2024-04-19 08:18:35: But the problem described in this ticket has also been fixed, so how is the real issue still open? And what is the "real issue" exactly? jan.nijtmans added on 2024-04-18 15:04:40: I found these related tickets: [18f4a94d03] and [67a5eabbd3d19591] The first was resolved by @nathan as Closed/Invalid, the second one 'fixed' by @nathan (which caused the first bug-report), but never backported to 8.6. It looks like the change in [67a5eabbd3d19591] is the root cause of all the discussions currently happening. There were some small post-commits (like [70a407c2453e5aa3]), but the real issue is still open (as shown with this ticket). I'm currently investigating this further, I'll come back on it. oehhar added on 2024-04-12 08:45:58: The small change was backported to core-8-branch by Jan by commit: [70a407c2453e5aa3]. Ashok mentioned in the chat, that:
Thank you all, Harald pooryorick added on 2024-04-04 11:57:27: In [70a407c2453e5aa3] I've made a small change that may at least keep the timer from being scheduled before the initial connection is complete: When updating the interest of the channel, if a buffer doesn't exist, or if the buffer hasn't been written to, or if the buffer is overflowing, then there is no reason to schedule the timer. In the case that a connection is being established asynchronously, but is not yet connected, the buffer won't exist, so the timer won't be scheduled. This fix may be sufficient to avoid setting the handler timer in all cases where a signal from the underlying resource can't be expected. apnadkarni added on 2024-03-30 15:23:01: Jan, I do not think you have fully understood the issue. It has little to do with write returning 0. It's about indications that network connections (or channels in general) have succeeded when they have not. Or with channels indicating they can accept data when they cannot. And I'll repeat for emphasis, this issue exists outside of twapi; it only came into the picture because I was investigating a failure with 9.0 that is not seen with 8.6.
I do not see any issues with the socket test case. It specifically tests failures of connections to a non-existent port.
The reason the timer events are generated is intended to be a workaround for some shortcomings in the refchan implementation. However, introducing a bug with clearly faulty logic is no workaround.
I've already followed up on the other ticket at [de232b49f26da1c] so will continue the discussion there.
jan.nijtmans added on 2024-03-30 14:42:53: I see two sides: Any channel implementation should be robust enough, if a "write" is executed on a channel which is not yet fully opened, it should just return 0 as a signal to the caller to try again later (or - if the channel is buffered, put the data in the buffer). Is twapi (and the Tcl socket implementation on windows) doing that correctly? If yes, additional timing events shouldn't matter: nothing is really written to the underlying channel. If no, that's a bug that should be fixed first. That said, I don't know the reason those timer events are generated. Maybe as a workaround for a buggy channel implementation. If we would remove this, it will mean that the buggy channel implementation will - again - start to behave wrong. Do we want that? In addition, is the socket-14.11.0 testcase written robust enough? For now, many more questions than answers. But removing this timer code is not the first thing I would recommend. Better, look first at the channel code behaving bad. pooryorick added on 2024-03-29 22:05:02: See my response over in [de232b49f26da1c]. jan.nijtmans added on 2024-03-29 20:43:12: @nathan, can you respond to this? And - if possible - fix it? apnadkarni added on 2024-03-29 05:42:50: Same bug is also the cause of the socket-14.11.0 test failures on Windows. See comment in [de232b49f2].
|