Tcl Source Code

View Ticket
Login
Ticket UUID: 080f846fd58958efeaa5e2247622dc0ee7fa1ff7
Title: Channel system generating writable events BEFORE channel is open (refchan)
Type: Bug Version: 9.0
Submitter: apnadkarni Created on: 2024-03-28 11:44:21
Subsystem: 25. Channel System Assigned To: nobody
Priority: 5 Medium Severity: Important
Status: Open Last Modified: 2024-04-19 08:34:54
Resolution: None Closed By: nobody
    Closed on:
Description:

A script may open a non-blocking channel and register an "writable" event handler to be called when the channel is open. This worked for refchans as well in 8.6 but in 9.0, the handler is invoked BEFORE the channel is actually open. The culprit is the following code in tclIO.c

if (statePtr->interestMask & TCL_WRITABLE
		&& GotFlag(statePtr, CHANNEL_NONBLOCKING)
		&& !GotFlag(statePtr, BG_FLUSH_SCHEDULED)) {
	    /*
	     * Restart the timer in case a channel handler reenters the event loop
	     * before UpdateInterest gets called by Tcl_NotifyChannel.
	     */
	    statePtr->timer = Tcl_CreateTimerHandler(SYNTHETIC_EVENT_TIME,
		ChannelTimerProc,chanPtr);
	    Tcl_NotifyChannel((Tcl_Channel) chanPtr, TCL_WRITABLE);

The above was added post-8.6 to ChannelTimerProc in d0dd6d19a4 as a fix for a ticket. No matter the motivation, this fragment seems incorrect or incomplete to me. Under no circumstances should a file event writable callback be invoked unless the channel is actually open (writable) and the above does not make that check.

The impacted case described below involves refchans because the refchan infrastructure uses timers to generate events but I suspect it could also affect ordinary channels that register timers to detect (for example) connection time outs.

I ran into the above porting twapi to 9.0. Twapi's TLS sockets are implemented as refchans wrapping Tcl sockets. The issue I'm seeing is the following test case:

    set so [twapi::tls_socket -async www.example.com 443]
    fconfigure $so -blocking 0
    fileevent $so writable [list apply {{chan} {
       ...chan is not actually open when this is called since connection has not completed
    }} $so]
    vwait ...

tls_socket is implemented as a reflected channel. In 8.6, there was no "automatic" timer based write event generation and thus the callback was invoked only on connection completion. With the above change, the callback is invoked from the timer code without any checks that the reflected channel is actually open.

A simple test case is not easy to create since it involves timing (the connection should not complete before the timer) and implementation of a virtual channel but the above code seems obviously wrong to me in that it does not check whether channel is actually open.

User Comments: pooryorick added on 2024-04-19 08:34:54:

To summarize again:

1) The "problem" described in this ticket is that a channel is being set to nonblocking before it is connected. The only thing that should be done with a channel before it is actually connected is to wait for it to connect, so the real answer to this ticket is, "don't do that." No example on the wiki has done that in the last 20 years.
2) Regardless of that argument, the code has now been changed so that setting a channel to non-blocking before it is connected now behaves as "expected".
3) After a channel event handler runs, it is necessary to somehow return again to the event loop before another handler runs so that the event loop can mediate the handling of events. That's why the timer was introduced back in the fix for [67a5eabbd3]. Without that, reentrancy bugs happen. The author of Atom recently wrote up a description of exactly the same problem in Ownership and data flow in GPUI:
This simplicity, however, led to a subtle bug that went unnoticed until the code was widely used in production. The problem manifested when one listener function emitted an event to the same emitter it was subscribed to. This inadvertently triggered reentrancy, where the emitting function was called again before it had completed its execution. This recursive-like behavior contradicted our expectation of linear function execution and got us into an unexpected state. Even though JavaScript's garbage collector enforces memory safety, the language's relaxed ownership model made it easy for me to write this bug.
Rust's constraints make this naive approach to rather more difficult. We're strongly encouraged down a different path, which prevents the kind of reetrancy I described above. In GPUI, when you call emit or notify, no listeners are invoked. Instead, we push data to a queue of effects. At the end of each update we flush these effects, popping from the front of the queue until it becomes empty and then returning control to the event loop. Any effect handler can itself push more effects, but the system eventually quiesces. This gives us run-to-completion semantics without reentrancy bugs and plays nicely with Rust.


pooryorick added on 2024-04-19 08:18:35:

But the problem described in this ticket has also been fixed, so how is the real issue still open? And what is the "real issue" exactly?


jan.nijtmans added on 2024-04-18 15:04:40:

I found these related tickets:

[18f4a94d03] and [67a5eabbd3d19591]

The first was resolved by @nathan as Closed/Invalid, the second one 'fixed' by @nathan (which caused the first bug-report), but never backported to 8.6.

It looks like the change in [67a5eabbd3d19591] is the root cause of all the discussions currently happening. There were some small post-commits (like [70a407c2453e5aa3]), but the real issue is still open (as shown with this ticket).

I'm currently investigating this further, I'll come back on it.


oehhar added on 2024-04-12 08:45:58:

The small change was backported to core-8-branch by Jan by commit: [70a407c2453e5aa3]. Ashok mentioned in the chat, that:

Thank you all, Harald


pooryorick added on 2024-04-04 11:57:27:

In [70a407c2453e5aa3] I've made a small change that may at least keep the timer from being scheduled before the initial connection is complete: When updating the interest of the channel, if a buffer doesn't exist, or if the buffer hasn't been written to, or if the buffer is overflowing, then there is no reason to schedule the timer. In the case that a connection is being established asynchronously, but is not yet connected, the buffer won't exist, so the timer won't be scheduled.

This fix may be sufficient to avoid setting the handler timer in all cases where a signal from the underlying resource can't be expected.


apnadkarni added on 2024-03-30 15:23:01:
Jan, I do not think you have fully understood the issue. It has little to do with write returning 0. It's about indications that network connections (or channels in general) have succeeded when they have not. Or with channels indicating they can accept data when they cannot. And I'll repeat for emphasis, this issue exists outside of twapi; it only came into the picture because I was investigating a failure with 9.0 that is not seen with 8.6.

I do not see any issues with the socket test case. It specifically tests failures of connections to a non-existent port.

The reason the timer events are generated is intended to be a workaround for some shortcomings in the refchan implementation. However, introducing a bug with clearly faulty logic is no workaround.

I've already followed up on the other ticket at [de232b49f26da1c] so will continue the discussion there.

jan.nijtmans added on 2024-03-30 14:42:53:

I see two sides: Any channel implementation should be robust enough, if a "write" is executed on a channel which is not yet fully opened, it should just return 0 as a signal to the caller to try again later (or - if the channel is buffered, put the data in the buffer). Is twapi (and the Tcl socket implementation on windows) doing that correctly? If yes, additional timing events shouldn't matter: nothing is really written to the underlying channel. If no, that's a bug that should be fixed first.

That said, I don't know the reason those timer events are generated. Maybe as a workaround for a buggy channel implementation. If we would remove this, it will mean that the buggy channel implementation will - again - start to behave wrong. Do we want that?

In addition, is the socket-14.11.0 testcase written robust enough?

For now, many more questions than answers. But removing this timer code is not the first thing I would recommend. Better, look first at the channel code behaving bad.


pooryorick added on 2024-03-29 22:05:02:

See my response over in [de232b49f26da1c].


jan.nijtmans added on 2024-03-29 20:43:12:

@nathan, can you respond to this? And - if possible - fix it?


apnadkarni added on 2024-03-29 05:42:50:
Same bug is also the cause of the socket-14.11.0 test failures on Windows. See comment in [de232b49f2].