Tcl Source Code

View Ticket
Login
Ticket UUID: de232b49f26da1c18e07513d4c7caa203cd27910
Title: write-only nonblocking refchan and Tcl internal buffers
Type: Bug Version:
Submitter: pooryorick Created on: 2019-04-26 20:53:55
Subsystem: 04. Async Events Assigned To: pooryorick
Priority: 5 Medium Severity: Important
Status: Open Last Modified: 2024-04-09 21:24:43
Resolution: None Closed By: nobody
    Closed on: 2019-05-01 11:00:15
Description:

Given the following write-only refchan, the coroutine attempting to use such a channel yields once and writes to the channel once. After the coroutine yields a second time, the channel never becomes writable again and the coroutine is never entered again:

namespace eval refchan {
    namespace ensemble create
    namespace export *


    proc finalize {chan args} {
        namespace delete c_$chan
    }


    proc initialize {chan args} {
        namespace eval c_$chan {}
        namespace upvar c_$chan watching watching
        set watching {}
        list finalize initialize seek watch write
    }


    proc watch {chan args} {
        namespace upvar c_$chan watching watching
        foreach arg $args {
            switch $arg {
                write {
                    if {$arg ni $watching} {
                        lappend watching $arg
                    }
                    chan postevent $chan $arg
                }
            }
        }
    }


    proc write {chan args} {
        chan postevent $chan write
        return 1
    }
}


after 0  [list coroutine c1 apply [list {} {
    set chan [chan create write refchan]
    chan configure $chan -blocking 0
    chan event $chan writable [list [info coroutine]]
    for {set i 0} {$i < 10} {incr i} {
        yield
        puts $chan hello
    }
    set ::done 1
}]]
vwait done

The problem is that tclIO.c/Write adds the value from the first write into an internal buffer but takes no actions that result in another call to Tcl_NotifyChannel. In particular, the "write" command of the reference channel is never called because the result of IsBufferFull is "false".

User Comments: pooryorick added on 2024-04-09 21:24:43:

Ashok wrote, "A change to preallocate channel buffers, for example, would break this 'guess'."

No, it wouldn't, because the check is that the buffer is not empty (i.e. there has already been a write performed), and not full (i.e. and the write has not been flushed).


pooryorick added on 2024-04-09 20:15:58:

Ashok, is io-14.11.0 still failing on your system with [70a407c2453e5aa3] (which is the fix for [080f846fd58958ef])?


apnadkarni added on 2024-04-08 15:10:50:

Nathan,

Starting with your last comment,

I think one thing we can all agree on is that if a device signals that its ready for a write...

To expand on that, I think we can all agree none of the current proposals are ideal. My concern with a broader, "better" solution has always been one of timing w.r.t 9.0 release. I have reiterated my concerns with the current implementation often enough. Alternative (1) below seems less risky in comparison to a more complete solution because of 8.6 history (notwithstanding any the other refchan bugs, restricting discussion to the issue in this ticket). Similarly, alternative (2) below seems less risky because the channel subsystem must in any case be presumed to work correctly with -buffering none. If it does not, that is something to be fixed irrespective of refchans or anything else. If someone has the time to work through the details, and sufficient review, a more complete solution is certainly something desirable. I just do not want to hold up 9.0 for this ideal scenario. If you take a stab at this (I'm still prioritizing getting my extensions and distributions 9.0-ready), it would certainly be a worthwhile exercise.

As to the other points,

Nathan wrote If you can eliminate that timer without breaking anything, than please do! What I'm arguing against is just reverting this change without providing any other fix, and without even providing an example of something that is broken and can't be fixed by any of the existing knobs.

I already suggested two solutions

  1. For the refchan to generate the event (off a timer). Why is this better than the way it is done currently within the channel core? Because (a) the refchan has better information about connection state, and (b) it only affects refchans where this is needed, not ALL channels (refchans or not).

  2. Implicitly setting -buffering none on refchans.

I already stated neither is ideal, I prefer (1), but both are better than what we currently have.

I am waiting for you to specifically state what breaks with either of the above or at least some technical reasons in opposition. The two objections I have heard are:

  • That would be a knee-jerk reaction to a particular problem rather than a well-considered design.

  • Paraphrasing my understanding of you stated, Scripts should not be burdened with the responsibility.

With regards to the first, it's difficult for me to translate an opinion to a broken behaviour or technical objection I can rationally discuss.

With regard to the second, if the refchan is implemented in scripts, only the script knows channel state, and must own the responsibility of generating events. Even with refchan enhancements, this would still be true; the core could generate events but the information about state would still have to be provided by the refchan through an enhanced API.

Moreover, your comments on how twapi or 14.11 or port scanners user code should be written actually burdens scripts with changing valid code to workaround the bugs introduced by an attempt to reduce their burden! The irony!

Nathan wrote The problem is that Tcl writes to a buffer without actually calling the [write] procedure that a refchan provides.

Yes, and -buffering none on refchans, implicitly or explicitly, forces that write, thereby resolving the issue for refchans!

Nathan wrote: No one has made test io-14.11.0 reliably fail in the error case, so it's currently not a meaningful test.

I've already commented on this "meaningful" criterion for tests in b7f4b5a3f4. If we have different opinions of what a meaningful test is, so be it. I will give another example of where socket connection open checks are important before doing any I/O on the channel (as in some of your examples that do puts). Applications, for example, may want to protect against DNS spoofing with reverse DNS lookups. This must be done before *any io on the channel, including output. When the connection open notification is received, the peer address must be available. This will not work with premature connection opens. (I understand you proposed buffer checks may fix this premature notification problem but want to insist 14.11 not be dismissed as meaningless in its intent or its implementation.)

Now coming to those additional proposed buffer checks,

Nathan wrote The timer can be constrained to cases where a write (perhaps only to a buffer) has already occurred and an unfilled buffer exists.

I saw that in commit 70a407c245. (I missed the comment in some other ticket; one consequence of discussion being spread across three tickets). I'm afraid this only illustrates the problem with the current design. It forces the channel core to attempt to deduce information it does not have. Here it is attempting to figure out channel state based on buffer state. Not only is this just a "guess", you have now made event generation a dependency on the buffer management code. A change to preallocate channel buffers, for example, would break this "guess". In SE terms, high coupling between components that are conceptually independent. This is not good design for maintainability.


pooryorick added on 2024-04-03 19:47:55:

You mean this fix that has been in place since 2019? Not unless this failure has been occurring for the last 4 years.


jan.nijtmans added on 2024-04-03 09:21:14:

I'm seeing the http-3.26.1 testcase fail sometimes on MacOS. Could this 'fix' be the cause of that?

See: https://github.com/tcltk/tcl/actions/runs/8534740843/job/23379729000


pooryorick added on 2024-04-02 12:36:43:

Adjusting the timer like that is one way to achieve rate control, but Ashok is arguing that rate control should be achieved by waiting for the device to signal that it's ready for more writes. The problem is that Tcl writes to a buffer without actually calling the [write] procedure that a refchan provides. The [write] procedure is the logical (and only) place for a refchan to signal that its ready for another write more, but because of this "write to a buffer instead of performing a write to the device" phenomenon, refchan must continuously signal that it's ready for a write, which means setting up a repeatedly-scheduled event.

I think one thing we can all agree on is that if a device signals that its ready for a write, and Tcl writes to a buffer instead of actually performing a write, then Tcl can assume that the device is still ready for a write, and can safely generate a "writable" event again. We should be able to develop an improved approach based on this assumption, and the new approach will eliminate the need for refchans to set up such repeatedly-scheduled events, and probably also eliminate the need for that CreateChannelScheduledEvent() function that sebres just pointed to.


sebres added on 2024-04-02 11:03:00:

Just few cents from my side about SYNTHETIC_TIMER - in my event-perf-branch from RFE [fdfbd5e10], I rewrote this synthetic timers with a scheduled event which can be retarded inside the handler if it is not yet ready.


pooryorick added on 2024-04-02 10:26:35:

TLDR: If you can find an acceptable way to eliminate that timer, then I'm all for it. Just don't revert this change wholesale. The situation calls for more effort than that.

Next...

I understand all of the principles that you've explained, and in theory agree. The thing is that Tcl provides a variety of knobs to achieve various behaviours, and also provides refchans, which must function even in the absence of external state changes that typical channels have. You're arguing from a matter or principle, but the trick is to accommodate all of these behaviours and contexts in practice. If you can eliminate that timer without breaking anything, than please do! What I'm arguing against is just reverting this change without providing any other fix, and without even providing an example of something that is broken and can't be fixed by any of the existing knobs. I have shown how to asynchronously connected by using [socket -async] and not configuring the channel as nonblocking until the "connected" event is handled. No one has made test io-14.11.0 reliably fail in the error case, so it's currently not a meaningful test. The idle queue starvation you mentioned was easily fixable separately. Until there is a concrete example showing something that needs fixing, the best way to continue is for any interested party to make improvements to existing code, and not to revert the change that fixed this issue.

I'm not fond of that timer either, and I've even provided an idea for taking steps towards eliminating it. Perhaps that got lost in the noise: The timer can be constrained to cases where a write (perhaps only to a buffer) has already occurred and an unfilled buffer exists. This would move things in the direction you're advocating by only employing the timer when it is really necessary. If you have time, you could do that work, or I'll do it when I have time, or maybe someone will come up with a way to eliminate the timer entirely. Let's work on it, and not just revert changes that have been in place for four years already, and that do solve issues like the one described in the current report.

All a refchan should have to do is use [chan postevent] to indicate that it is writable. Yes, refchans currently set up timers keep those writable events posting, but as you've just elegantly explained, that's a bad hack that violates the principles of asynchronous channels. So let's not enshrine that in the design, but instead get to work on the internals so that things remain simple at the script level, and we can continue to improve the situation in the implementation as intelligence and time resources become available.


apnadkarni added on 2024-04-02 07:35:31:

Nathan,

I'll make one last attempt at persuasion that

  • from an architectural point of view, generating I/O callback events without knowing channel state is not at all an appropriate model for async event driven I/O.

  • The related changes in trunk vis-a-vis 8.6 are broken in multiple aspects, mostly because of the above.

The motivation for async i/o is that applications can do useful work while waiting for I/O to complete. Now this by itself does not need select on Unix, completion ports on Windows or fileevent in Tcl. One can just call the i/o functions and check for "EAGAIN" equivalents. There are multiple problems with this as the application has to poll:

  • Try too frequently and its wasting processing cycles processing EAGAINS.

  • Try too infrequently and there is an unnecessary delay / latency in servicing I/O requests.

An event system solves both the above as long as events are generated based on the I/O state. No time is wasted in unnecessary polls and there is not additional delay once the channel is ready for I/O. However, if I/O events are generated based on timers with no knowledge of I/O state it has exactly the same effect as application polling! It is completely pointless - generating events when I/O state does not reflect channel readiness, and unnecessary delays after readiness before the timer expires. If you truly believe this as a solution, you should be amenable to completely getting rid of Tcl's I/O related event subsystem! The application could just generate write events using chan postevent on a regular basis with after. This is basically what the current Tcl 9 implementation does, queueing events on a timer basis.

The above is the motivation for async/event-driven operation from an efficiency and performance perspective. However, there is also the "simple" matter of correctness. An event should not be generated prematurely reflecting a state that does not actually exist.

In other words, from both the performance and correctness point of view, the current timer based write event generation in ChannelTimerProc which is not based on channel state is fundamentally flawed. It does not fulfil the intended purpose of an event based i/o system by essentially polling, and moreover does not meet correctness criteria as it has no idea of channel state and generates events prematurely (14.11 failure).

All the above is a comment on event driven i/o and channels in general. Now as far as refchans are concerned, there is a limitation in the refchan framework as mentioned in an earlier post which led to the original defect you logged in this ticket. To reiterate, there is no script level equivalent of the Tcl_EventSetupProc and Tcl_EventCheckProc. Thus some, not all, refchans are forced to use a timer base scheme to generate these events. However, this must be done by the refchan script implementation itself and not the generic channel infrastructure because the former knows the channel state, the latter does not. This is still not ideal from the efficiency perspective, but being able to check state, it is at least correct.

I believe that is how Andreas' virtchannel modules in tcllib work. And as proof of concept, following the tcllib model, I've modified your refchan implementation of 44.6 and attached as refchan-async-redux.tcl (proof of concept only modeling tcllib). This works in Tcl 8.6 as well (which your version did not). TL;DR the changes you made to the core in ChannelTimerProc (a) lead to at least two bugs 11.14 and event q starvation, (b) affected channels other than refchans, (c) were unnecessary.

Given (imho) the enhancement of the refchan framework as too much of a risk for a 9.0 release, I see two possibilities that are acceptable (not perfect, just acceptable) for 9.0 release:

  • revert the implementation to what 8.6 does. No need for -buffering none in this case but the script level refchan implementation has to generate timer events, check state and then do a after idle after 0 chan postevent from the timer callback. See tcllib or attached sample. Alternatively, the refchan can do the -buffering none itself and avoid the timer if that suits its purpose.

  • If you do not want the channel script implementation to have that responsibility, (I would like to know why not) then set -buffering none for refchans as proposed in my branch.

I prefer (1).

I am pretty much going to stay silent on this topic now. I cannot provide any more clarity on my objections. Finally, some group of people has to decide on a course of action. Hopefully, that group is not just you and me.


apnadkarni added on 2024-04-02 03:16:12:

Nathan,

I've created a new ticket, [ac7592e73c10d04b], for the idle queue starvation that Ashok provided an example of, and also fixed the issue by having tclIO.c/ChannelTimerProc() us the idle queue instead of the standard queue to reschedule itself.

The following concerns come to mind with this change:

  • Changing to the idle queue would seem to have similar effects to changing the SYNTHETIC_TIMER from 0 to a positive number. The comments for that #define specifically warn against that.

  • Posting to the idle queue when other file events are posted to the event queue seems like it could lead to the reverse situation where your posted events are starved.

  • Given that event processing in Tcl_DoOneEvent and its ilk allows filtering based on flags TCL_FILE_EVENTS|TCL_TIMER_EVENTS|TCL_IDLE_EVENTS this further confuses what those flags mean. Will your events fire in a TCL_FILE_EVENTS loop?

However, I am not going to spend time chasing the above when I believe the fundamental design itself is flawed, irrespective of whether you use the timer queue or the idle queue.


apnadkarni added on 2024-04-02 02:49:54:

Nathan,

The reason I backed out your test case modifications were differing semantics. You added an additional vwait command in effect creating a delay that allows the connection success/failure to happen in time! I do think the original test could have been better targeted because the puts is irrelevant to the actual bug. I may add a separate test case in my branch to illustrate the actual connection failure but don't want to muddy the waters here.

And regarding your comment -

If you want to prove your point, and also show that you're not just holding my changes hostage to win this argument, you'll make if fail consistently on all platforms so that it has meaning as a test.

Some test cases, by their nature, cannot be made to fail reliably on every platform every time. This can be either because the failure is timing dependent, depends on the libraries in use, operating system versions etc. Examples include failures caused by race conditions in multithreaded cases - it may take millions of iterations to cause a failure, may depend on processor speed, number of processors etc. So also, 14.11 is timing dependent. There is effectively a race between the connection completion and delivery of the write event.

FWIW, on my system 14.11 fails every single time. I do not know if that reproducibility is specific to Windows or to my system. Don indicated it is sporadic on his (non-Windows) system which is an indication it is timing related.

So yes, while ideally failures would be consistently reproducible, across all platforms all the time, that is not always possible. Does not make the test invalid.

And in cases like this, where the cause of the failure is ascertained to be the code being tested, all the more incentive to keep the test.

Think of test failures analogous to the user reporting a crash. I suppose your response would be "Oh unless you can show that happens on all systems, it is not a bug"!


pooryorick added on 2024-04-01 21:16:02:

Given that [socket -async] returns a blocking channel, it's more likely that the designers intended it to remain blocking until the connection is confirmed to be established.


pooryorick added on 2024-04-01 21:13:15:

The examples on the wiki were written over the course of the last 20 years, so I don't know which "bug" you're talking about, given that the changes Ashok and I are debating were introduced more recently.


jan.nijtmans added on 2024-04-01 20:47:32:

> Almost all of the examples of [socket -async] on the wiki set the channel to nonblocking only after the connection is established. Seems like historically, people understood the right way to use [socket -async].

It's more likely that many people found out that setting [socket -async] from the start doesn't work right, because the Tcl implementation is buggy. Why setting up the channel cannot be asynchronious? Let's fix that!


pooryorick added on 2024-04-01 20:33:30:

If 14.11.0 was intentionally written to test this particular thing, then why is it only sporadically failing now, and only on some platforms? Also, why do the other tests that use the same pattern not fail then? Written as they are, only to a some-value-of-"works" standard, these tests are at least a little bit suspect. Almost all of the examples of [socket -async] on the wiki set the channel to nonblocking only after the connection is established. Seems like historically, people understood the right way to use [socket -async].

Regarding the fix for [ac7592e73c10d04b], care to explain how it "hides the real problem"? I think the fix gets a red herring out of the way so we don't waste time discussing it. The timer itself is debatable, but we might as well debate a correct implementation of it.


jan.nijtmans added on 2024-04-01 18:19:25:
> Ashok, since you undid my change to socket test 14.11.0 ...

Ashok's action was IMHO fully justified. @nathan, please don't change a testcase which is written as-such on purpose. Ashok is not holding your changes hostage to win the argument: the goal here is to come up with a _real_ solution.

About the change from timer event to idle event: It's - indeed - actually the same, but it also hides the real problem. Not a good change to make directly on trunk, while discussion is ongoing. It's also not bad enough to start a 'commit war'.

pooryorick added on 2024-04-01 00:29:51:

I've created a new ticket, [ac7592e73c10d04b], for the idle queue starvation that Ashok provided an example of, and also fixed the issue by having tclIO.c/ChannelTimerProc() us the idle queue instead of the standard queue to reschedule itself.


pooryorick added on 2024-03-31 17:39:07:

Ashok, since you undid my change to socket test 14.11.0, it is now back to failing sporadically. If you want to prove your point, and also show that you're not just holding my changes hostage to win this argument, you'll make if fail consistently on all platforms so that it has meaning as a test.


pooryorick added on 2024-03-31 17:03:58:

I don't see anything wrong with delaying the writable event notification until a connection is established, but that should be done in such a way that this issue remains fixed. I have an idea for that: The issue here is that "writable" event notice happens once, then a "writable" event happens once but then never again because Tcl writes to a buffer instead of calling the "write" procedure for the refChan. I think it's pretty clear that refchan implementations shouldn't be burdened with the duty of maintaining timers, so somehow, Tcl must keep the "writable" event signaling until the "write" procedure of the refchan is actually called. The timer in UpdateInterest() is currently taking care of that. There is an opportunity to have Write() set up the timer instead, since that's the procedure that would call refchan if it a buffer was full. In that case, the timer-based "writable" event would only happen after some initial event has lead to a call to Write().


apnadkarni added on 2024-03-31 16:31:32:
Nathan,

Regarding the GUI freeze, I put it here because the root cause is the exact same. If you trace through the code you will see ChannelTimerProc queueing the timer event which gets placed on the event queue, calling ChannelTimerProc queueing the event, which gets placed on the event queue ... Event queue never gets drained so the idle queue never runs.

I'd rather not have the same discussion under 3 tickets. If you find some other root cause and I'm mistaken, you can move it to another ticket.

/Ashok

apnadkarni added on 2024-03-31 16:26:23:
@dgp, I had looked at that ticket you mentioned earlier. I think that is a much broader and trickier scope.

apnadkarni added on 2024-03-31 16:24:20:
Nathan,

I have moved your change to 14.11.0 to a mistake branch.

This is the first time I have reverted someone else's changes, sorry, but it is simply not acceptable to change an perfectly functional test so as to not exercise a bug. The changes you made are not semantically equivalent.

You have repeatedly tried to justify your design by arguing that functional, valid user code to be changed. This is another example.

/Ashok

pooryorick added on 2024-03-31 16:23:19:

Ashok, that last comment belongs in its own ticket so it can be properly diagnosed and remedied separately if possible.


dgp added on 2024-03-31 16:14:41:
This may be off-topic.  Ignore or delete if it is.

I see much activity regarding events and channels, and I wonder if
any of it is relevant to ticket

https://core.tcl-lang.org/tcl/tktview/ef28eb1f15167156c898

If there's a chance to tackle that too while folks are already in the code.

apnadkarni added on 2024-03-31 15:49:13:

As it stands, there is another problem with the current implementation. If run in wish, it freezes the GUI as the idle events never get to run. I suspect the write event keeps getting queued after an interval of SYNTHETIC_EVENT_TIME which happens to be 0 which means the event queue is never drained. Test io-44.6 which tests the implementation never actually tests that other events are not blocked.

The attached script illustrates this. This is essentially a modified version of io-44.6.

With tclsh 8.6, the time is printed every second

c:\src\tcltk\wip\tcl\tests>tclsh86t ..\tests\refchan-async-test.tcl
1711899031
1711899032
1711899033
1711899034
1711899035
^C

With 9.0 however, after the initial invocation the time is never printed

c:\src\tcltk\wip\tcl\tests>d:\tcl\90\x64-debug\bin\tclsh90.exe ..\tests\refchan-async-test.tcl
1711899100
^C

The branch apn-bug-de232b49f2 works properly

c:\src\tcltk\wip\tcl\tests>d:\tcl\wip\x64-debug\bin\tclsh90.exe ..\tests\refchan-async-test.tcl
1711899197
1711899198
1711899199
1711899200
^C

Now could one change SYNTHETIC_EVENT_TIME to something other than 0? The comment there says Must be zero or bad things tend to happen. Could it possibly do the equivalent of "after idle after 0"? I suspect this will have the same undesirable effect as changing SYNTHETIC_EVENT_TIME and anyways feels like a patch for a patch for fundamentally broken design.

The above illustrates tclsh but the same effect is seen with wish.

d:\tcl\wip\x64-debug\bin\wish90.exe ..\tests\refchan-async-test.tcl

pops up a text widget that can be edited. On the other hand, with the existing 9.0, the text widget is shown but no characters are displayed when you type and the GUI appears frozen.

This. Polling. Based. Design. For. Events. Is. Just. Plain. Wrong.

/Ashok


pooryorick added on 2024-03-31 15:45:57:

There has been an example on the wiki for 20 years already showing how to do port scanning in pure Tcl by connecting asynchronously to each port, and it wasn't broken by the fix for this issue because it never sets the channel to nonblocking: Port scanning in tcl.


pooryorick added on 2024-03-31 15:39:56:

I did address the port scanner issue by providing example code for asynchronously detecting when a port is open.


pooryorick added on 2024-03-31 15:37:56:

Also, in [1b9bd55b7af13673] I just fixed socket test 14.11.0 so that it will no longer sporadically fail on some platforms.


pooryorick added on 2024-03-31 15:23:51:

Ashok wrote, "It is about the use of writable events to asynchronously detect when a connection is open."

Again:

    set so [twapi::tls_socket -async www.example.com 443]
	    fileevent $so writable [list apply {chan {
		    #chan is now connected
		    fconfigure $so -blocking 0 
		    fileevent $so writable [list apply {chan {
			    #do stuff with connected channel
		    } $so]
	    }} $so]
	    vwait ...
    

That example asynchronously detects when a connection is open, so we're done here, right?


apnadkarni added on 2024-03-31 14:06:03:

Nathan,

Most of your post focuses on something that is not crux of the use case being discussed.

This is not about the use of writable events to ascertain when a channel can be written to. It is about the use of writable events to asynchronously detect when a connection is open.

From the socket -async docs,

Running the event loop also allows you to set up a writable channel event on the socket to get notified when the asynchronous connection has succeeded or failed.

The complaint I have is that in 9.0 the writable channel event is generated prematurely - before the connection has succeeded or failed. While I might have used the "writable event" loosely, it is in this context - signifying the connection is open.

While you airly dismiss test the long standing socket-14.11.0 as a broken test, this test is almost exactly how the simplest of network applications - port scanners - would be written ("almost" because a port scanner would not do the puts). Your "fix" makes it impossible to write a reasonable (synchronous or polling is not reasonable) one in Tcl. I do notice you did not address the port scanner example or the address logging example in your response.

And to the specifics of your comments...

...twapi related example.. - no, your code fragment, does not solve the problem. When the lambda fires, the connection is not complete. The fconfigure and fileevent following are completely irrelevant. See above. Even otherwise, the idiom of of socket -async followed by fconfigure -blocking 0 has been around forever. There has been nothing to forbid it. There is no justification for asking that to be reworked as a workaround for a buggy core change.

...Here's another example... about puts after a socket -async. Again, irrelevant. puts is not the issue. See above.

...socket-14.11.0 failures... Oh if only we could fix all test failures by disparaging the test! This is a perfectly reasonable test reflecting a real world use case. I also do not understand your purpose behind quoting the manpage flushing /closing the channel. That is completely irrelevant to this issue.

...I believe this existing fix is theoretically correct... Outside of some special cases, a timer based event generation scheme for I/O is completely pointless. It defeats the entire purpose of event based I/O. Imagine the Unix select call, which serves a similar purpose, marking descriptors as "ready" based on a timer as opposed to actually being ready.

Timer based event generation for i/o is essentially polling at regular intervals. There is no point to such an "event driven" i/o system where the "event" is a timer as opposed to i/o! An application might as well just poll itself using after!

A few years ago I submitted issue 67a5eabbd3 and fixed it Relevance? No one is disputing refchan limitations but my -buffering none branch does not break that issue either. I could not find a relevant test case but I manually tested the script in that issue.

I'm not in favour of changing the default buffering for refchans. It would help if you pointed out the technical issues or negative practical impact as to why. I've already stated in my earlier post that fixing refchans would be preferable but I do not think that is feasible in a reasonable time frame. An opinion that "it is a knee jerk reaction" is neither here or there. Your reference to the 080f846fd58958ef is a red herring because you did not address the actual issue being raised there.


pooryorick added on 2024-03-31 01:12:56:

Yes, I incorrectly claimed that [chan configure ... -blocking] was redundant with [socket -async]. I forgot that [socket -async] returned a blocking channel. However, the fact that [socket -async] returns a blocking channel provides the solution to [080f846fd58958ef]: Don't immediately configure the channel to be nonblocking, and instead do this:

	set so [twapi::tls_socket -async www.example.com 443]
	fileevent $so writable [list apply {chan {
		#chan is now connected
		fconfigure $so -blocking 0 
		fileevent $so writable [list apply {chan {
			#do stuff with connected channel
		} $so]
	}} $so]
	vwait ...

Problem solved: The event will be triggered only when the channel is connected. Now, on to the issue of "writable" events on non-blocking channels. If a channel exists and is nonblocking, it's writable. Any other state of the channel, whether it is error state or connection state is not a factor. That's how nonblocking channels have always functioned in Tcl, and if you think about it, that's the way it must be: If something is blocking the writability of a channel, then it's no-longer nonblocking. That's why it has always been the case that for a nonblocking channel [puts] never fails, and also why "Tcl continues to buffer the data and writes it in the background as fast as the underlying file or device can accept it." ([puts] documentation), and why [close] is the operation that reports the error, if there was one: There's no other strategy that is practical.

Here's another example:

	set chan [socket -async <some address>]
	chan configure $chan -blocking 0
	puts $chan hello

Here, it's entirely possible that the socket will not be connected yet, but when [puts] executes it will never fail, nor should it. That's the point of configuring a channel to be nonblocking: It doesn't ever block. By definition, if a channel never blocks, it is always writable. Over in [080f846fd58958ef] Ashok wrote, "Under no circumstances should a file event writable callback be invoked unless the channel is actually open (writable) and the above does not make that check." As we see in the example above, that's wrong: [puts] doesn't fail, and if it doesn't fail, then the channel was writable. If the channel is writable, there's no problem triggering a corresponding event on the the channel. Not only that, but the fire-and-forget aspect of nonblocking channels is very convenient, particularly at the script level.

I incorrectly claimed that the test 14.11.0 failure we were talking about was the case where it produces an error. The actual situation is that the test fails if an error does not occur, but I believe it is by design that sometimes there is no error. The documentation for [close] states,

"If the channel is nonblocking and there is unflushed output, the channel remains open and the command returns immediately; output will be flushed in the background and the channel will be closed when all the flushing is complete."

That means that for systems that take their sweet time figuring that a connection can't be completed, [close] will in the meantime return without error, which in turn means that test 14.11.0 is only testing circumstance rather than anything meaningful. Perhaps that test should be deleted, or perhaps a pause should be introduced to let the system finish meditating on the state of the connection.

I believe that the existing fix for this issue is theoretically correct, and should remain in place. A few years ago I submitted issue [67a5eabbd3] and fixed it. Prior to this, refchans were thoroughly broken in an subtle way, and I believe this caused refchan programmers over the years to throw up their hands and just walk away from refchans, or to invent twisted cargo-cult approaches to keeping their refchans responsive. That issue is related to this one. Not taking the approach that has now been established for triggering writable events on asynchronous channels risks making them unusable again in ways that are too subtle for most programmers to ascertain, and in my experience means that we'll just keep playing whack-a-mole with issues related to handlers not ever getting triggered even though the should be.

I'm not in favour of changing the default buffering for refchans. That would be a knee-jerk reaction to a particular problem rather than a well-considered design, and in issue [080f846fd58958ef] there is so far no actual example of a broken script that would warrant such a change.


jan.nijtmans added on 2024-03-30 16:52:11:

> Jan, I do not think you have fully understood the issue

Indeed, I don't understand 100% of all details, but globally, yes I do.

Switching buffering off for refchan's by default sounds OK to me: Buffering is meant to assure that the channel can continue operating if the next step is blocked. So, in a chain, if one of the elements has buffering on, all the other channels are able to operate through that.

Also, delaying the timer even until after the channel is fully open sounds OK to me. But it would be good to assure it doesn't break any outside channel implementation. Better be a little bit too careful here.....

Anyway: discussion about this is a good thing happening! You are bringing in a lot of arguments and examples! That's constructive, and - hopefully - helps finding a solution.


apnadkarni added on 2024-03-30 15:59:38:
With respect to side effects with my branch setting refchan buffering to none by default, I have not seen any after experimenting for a while. All tests pass on Win and Linux and for the most obvious consequence - performance - there is no discernible difference:

With current trunk:

```
% timerate -calibrate {}
0.08983439299958879 µs/#-overhead 0.090688 µs/# 21524429 # 11026859 #/sec
% set fd [tcl::chan::memchan]
rc0
% timerate {puts -nonewline $fd x}
7.680620 µs/# 128693 # 130197 #/sec 988.442 net-ms
% seek $fd 0
% string length [read $fd]
128693
```

With the -buffering none based solution

```
% timerate -calibrate {}
0.09071036436181426 µs/#-overhead 0.091423 µs/# 21351367 # 10938200 #/sec
% set fd [tcl::chan::memchan]
rc0
% timerate {puts -nonewline $fd x}
7.687302 µs/# 128568 # 130084 #/sec 988.341 net-ms
% string length [read $fd]
0
% seek $fd 0
% string length [read $fd]
128568
```

I suppose one might expect this. Buffering is really meant to mitigate the cost of disk i/o and system call transitions. There is none with reflected channels.

An alternative, but similar effect, would be to set buffersize to 1 but the related code is a little more convoluted and I don't think offers any benefit over -buffering none.

I'm inclined to favor the -buffering none solution. It handles not the socket test and Nathan's coro test. Docs will have to note refchans do not get buffering by default. Also, we would have to decided whether the application can change the buffering mode (I would say yes as long as the potential issues are noted in the documentation. If that is not desired, we could simply force the buffering to be -none and ignore / raise error on any requests for change).

apnadkarni added on 2024-03-30 15:38:55:

While we are on this code, another doubt regarding the surrounding code in question comes to mind. In 8.6, the code fragment that checks for read event generation was always executed (assuming the appropriate conditions were met). In 9.0, if the new write event checks (that we are discussing) match, the read side is not checked at all. Is this intended? Not directly relevant to current discussion but I'm curious to know the reasoning.

Code in question is https://core.tcl-lang.org/tcl/file?ci=8a8722e215ef8d63&name=generic/tclIO.c&ln=8761-8793


apnadkarni added on 2024-03-30 12:02:18:

A channel writable event is being generated by a piece of code on a periodic basis with no knowledge of the channel state and whether it is actually writable or not. This is so broken as to be self-evident in my opinion and should need no debate but to answer Nathan's points below.

The failure of socket-14.11.0, marking a connection when it is not yet open, or is not a trivial matter. That specific scenario is exactly one that might be used to implement a port scanner (which always operate asynchronously for obvious reasons). The current trunk implementation will show all ports open even when they are not! And note the script is not aware the channel is broken as you seem to think!

Consider the complementary case where they are open. The event fires BEFORE the TCP connection actually completes, the write handler runs and writes connection details to syslog. Not an uncommon scenario but the fconfigure that retrieves address/port information returns empty strings because the connection has not completed making the logging useless.

A third example - admittedly a code inspection based conjecture; I have not tried this as I don't have a multihomed machine. Tcl's socket implementation is smart enough that if the destination hosts resolves to multiple addresses, it will try each in turn until one succeeds. The write event will trigger on the first attempted connection which may actually fail when a subsequent candidate succeeds.

Now with respect to the -async option, let's ignore twapi because it simply clones Tcl socket semantics. The two illustrative fragments you gave are very different semantics. When the -async option is specified, the command returns without waiting for the TCP connect to be completed and connection open. The application can continue other work until the connection is open. Without the -async option, the application is blocked from any other work until the connection is open, which may be several or even tens of milliseconds. Changing socket with fconfigure -blocking 0 cannot change this (in the second example!). -async does not make the socket non-blocking, and -blocking 0 does not make the connection asynchronous (how can it when it can only be done afterward!). So I completely fail to understand your comment of redundancy.

The fundamental problem here is that the reflected channel framework has incomplete support when it comes to event driven I/O. All channel drivers I have written or seen have two components. The first is the channel driver itself - in this case the tclrchannel driver. The second component of the driver hooks into Tcl event processing via the Tcl_EventSetupProc / Tcl_EventCheckProc callbacks. These are where the channel implementations generate notification events but the refchan framework has equivalent. The correct fix would be to add the equivalent facilities to refchan, or alternatively enhance the channel driver API (and refchan) to be able to exchange state information that would let the common channel code make proper decisions that depend on channel state.

Both are likely to be non-trivial and tricky to get right, possibly why they were never implemented, and probably not feasible in the 9.0 time frame.

As an alternative solution, I propose turning off buffering for refchans (either via an internal -buffering none or a -buffersize 1). This forces calls to refchans always to be passed down to the refchan script which is then responsible for generating events exactly as Nathan's test code does. This experiment is in branch apn-bug-de232b49f2. There are some side effects that I will detail later but for now, all tests pass (both socket-14.11.0 as well as io-44.6 which tests Nathan's coroutine based driver).


pooryorick added on 2024-03-29 21:51:53:

The refchan issue described here is far more serious than the failure of test socket-14.11.0, because in a situation like the one illustrated by that test, at least the script becomes aware that the channel is broken, and has the opportunity to respond somehow. In contrast, any script that falls victim to the refchan issue described in this ticket, it has no recourse, and simply fails silently and irretrievably.

There are many ways in which channel can become broken, and a socket failure to connect is for practical purposes not much different than any of those other failure modes. In my opinion, "async" should mean async from start to finish, and the script itself can decide how to handle the various possible failures in an asynchronous way.

In the meantime there are ways to avoid the problem Ashok illustrated with twapi in [080f846fd5]. For one thing, the example is a little strange:

      set so [twapi::tls_socket -async www.example.com 443]
      fconfigure $so -blocking 0
  

The socket is opened in asynchronous mode, but the for some reason it is still necessary to set the blocking mode of the resulting channel to 0. Why the redundancy? I would guess that if the "-async" argument were removed, then twapi would wait for the connection to complete, and then the rest of the code would continue to function asynchronously:

      set so [twapi::tls_socket www.example.com 443]
      fconfigure $so -blocking 0
  

The same is true for socket-14.11.0. The documentation for the "-async" option to [socket] has long described the possibility: "This means that the socket will be created immediately but may not yet be connected to the server, when the call to socket returns."

So even just following the documentation will avoid the issue Ashok has raised. I would not revert this fix. There may well be a more satisfying fix, and if it is found, it can of course replace this one.


jan.nijtmans added on 2024-03-29 20:46:23:

@nathan, can you repond? Would it be best to revert this 'fix', and find a better one?


apnadkarni added on 2024-03-29 05:40:37:

Marking this as not fixed as I believe the "cure" is worse than the disease. The change results in bug 080f846fd5. See that ticket for details but in a nutshell generating "writable" events without ascertaining that the channel is in fact open is fundamentally wrong.

Chasing that bug also led me to (finally) chase down the cause of the socket-14.11.0 test failure (that Don reported subsequent to this commit which is completely reproducible on Windows and something that I've been meaning to track down for a while) which led back to exactly this again. The write handler is invoked before the channel is actually open. This should never be the case.

Some other solution to the original failure in this ticket is needed other than generating writable events on a timer. At the very least some additional state checks that the channel is open is needed.


dgp added on 2021-05-27 15:19:23:
I still frequently (though not always) see test socket-14.11.0 fail.

pooryorick added on 2021-05-27 06:29:04:
The issues mentioned may have been fixed by subsequet work.  Are there any outstanding issues related to this ticket, or can it be closed now?

dgp added on 2019-05-06 13:50:46:
Failures of chan-io-44.6 may be arising here too.

dgp added on 2019-05-06 13:17:19:
The test socket-14.11.0 started failing with this change. Should we resolve
here or I open a new ticket?

pooryorick added on 2019-05-01 11:00:15:

Further explanation: The documentation for [chan event] states that a channel is considered to be writable if at least one byte of data can be written to the underlying file or device without blocking, or if an error condition is present on the underlying file or device. However, for better performance, Tcl produces writable events until the internal buffer is flushed. This fix extends this behaviour to reference channels.


pooryorick added on 2019-04-27 07:24:06:

Fixed in [d0dd6d19a41ae7af].


Attachments: