Tcl Source Code

View Ticket
Login
Ticket UUID: de232b49f26da1c18e07513d4c7caa203cd27910
Title: write-only nonblocking refchan and Tcl internal buffers
Type: Bug Version:
Submitter: pooryorick Created on: 2019-04-26 20:53:55
Subsystem: 04. Async Events Assigned To: pooryorick
Priority: 5 Medium Severity: Important
Status: Open Last Modified: 2024-04-09 21:24:43
Resolution: None Closed By: nobody
    Closed on: 2019-05-01 11:00:15
Description: (text/x-fossil-wiki)
Given the following write-only refchan, the coroutine attempting to use such a
channel yields once and writes to the channel once.  After the coroutine yields
a second time, the channel never becomes writable again and the coroutine is
never entered again:


<code><verbatim>
namespace eval refchan {
    namespace ensemble create
    namespace export *


    proc finalize {chan args} {
        namespace delete c_$chan
    }


    proc initialize {chan args} {
        namespace eval c_$chan {}
        namespace upvar c_$chan watching watching
        set watching {}
        list finalize initialize seek watch write
    }


    proc watch {chan args} {
        namespace upvar c_$chan watching watching
        foreach arg $args {
            switch $arg {
                write {
                    if {$arg ni $watching} {
                        lappend watching $arg
                    }
                    chan postevent $chan $arg
                }
            }
        }
    }


    proc write {chan args} {
        chan postevent $chan write
        return 1
    }
}


after 0  [list coroutine c1 apply [list {} {
    set chan [chan create write refchan]
    chan configure $chan -blocking 0
    chan event $chan writable [list [info coroutine]]
    for {set i 0} {$i < 10} {incr i} {
        yield
        puts $chan hello
    }
    set ::done 1
}]]
vwait done
</verbatim></code>


The problem is that <code>tclIO.c/Write</code> adds the value from the first
write into an internal buffer but takes no actions that result in another call
to <code>Tcl_NotifyChannel</code>.  In particular, the "write" command of the
reference channel is never called because the result of
<code>IsBufferFull</code> is "false".
User Comments: pooryorick added on 2024-04-09 21:24:43: (text/x-fossil-wiki)
Ashok wrote, "A change to preallocate channel
buffers, for example, would break this 'guess'."

No, it wouldn't, because the check is that the buffer is not empty (i.e. there has already been a write performed), and not full (i.e. and the write has not been flushed).

pooryorick added on 2024-04-09 20:15:58: (text/x-fossil-wiki)
Ashok, is io-14.11.0 still failing on your system with [70a407c2453e5aa3] (which is the fix for [080f846fd58958ef])?

apnadkarni added on 2024-04-08 15:10:50: (text/x-markdown)
Nathan,

Starting with your last comment,

*I think one thing we can all agree on is that if a device signals that its
ready for a write...*

To expand on that, I think we can all agree none of the current proposals are
ideal. My concern with a broader, "better" solution has always been one of
timing w.r.t 9.0 release. I have reiterated my concerns with the current
implementation often enough. Alternative (1) below seems less risky in
comparison to a more complete solution because of 8.6 history (notwithstanding
any the other refchan bugs, restricting discussion to the issue in this ticket).
Similarly, alternative (2) below seems less risky because the channel subsystem
must in any case be presumed to work correctly with -buffering none. If it does
not, that is something to be fixed irrespective of refchans or anything else. If
someone has the time to work through the details, and sufficient review, a more
complete solution is certainly something desirable. I just do not want to hold
up 9.0 for this ideal scenario. If you take a stab at this (I'm still
prioritizing getting my extensions and distributions 9.0-ready), it would
certainly be a worthwhile exercise.

As to the other points,

Nathan wrote *If you can eliminate that timer without breaking anything, than
please do! What I'm arguing against is just reverting this change without
providing any other fix, and without even providing an example of something that
is broken and can't be fixed by any of the existing knobs.*

I already suggested two solutions

1. For the refchan to generate the event (off a timer). Why is this better than
the way it is done currently within the channel core? Because (a) the refchan
has better information about connection state, and (b) it only affects refchans
where this is needed, not ALL channels (refchans or not).

2. Implicitly setting -buffering none on refchans.

I already stated neither is ideal, I prefer (1), but both are better than what we
currently have.

I am waiting for you to specifically state what breaks with either of the above
or at least some technical reasons in opposition. The two objections I have heard
are:

- *That would be a knee-jerk reaction to a particular problem rather than a
well-considered design*.

- Paraphrasing my understanding of you stated, *Scripts should not be burdened
with the responsibility*.

With regards to the first, it's difficult for me to translate an opinion to a
broken behaviour or technical objection I can rationally discuss.

With regard to the second, if the refchan is implemented in
scripts, only the script knows channel state, and **must** own the
responsibility of generating events. Even with refchan enhancements, this would
still be true; the core could generate events but the information about state
would still have to be provided by the refchan through an enhanced API.

Moreover, your comments on how twapi or 14.11 or port scanners user code should
be written actually **burdens scripts with changing valid code to workaround the
bugs introduced by an attempt to reduce their burden**! The irony!

Nathan wrote *The problem is that Tcl writes to a buffer without actually
calling the [write] procedure that a refchan provides.*

Yes, and -buffering none on refchans, implicitly or explicitly, forces that
write, thereby resolving the issue for refchans!

Nathan wrote: *No one has made test io-14.11.0 reliably fail in the error case,
so it's currently not a meaningful test.*

I've already commented on this "meaningful" criterion for tests in
[b7f4b5a3f4](https://core.tcl-lang.org/tcl/info/b7f4b5a3f4a34b37). If we
have different opinions of what a meaningful test is, so be it. I will
give another example of where socket connection open checks are important
**before** doing any I/O on the channel (as in some of your examples that
do puts). Applications, for example, may want to protect against DNS spoofing
with reverse DNS lookups. This must be done before **any* io on the channel,
including output. When the connection open notification is received, the
peer address must be available. This will not work with premature connection
opens. (I understand you proposed buffer checks may fix this premature
notification problem but want to insist 14.11 not be dismissed as meaningless
in its intent or its implementation.)

Now coming to those additional proposed buffer checks,

Nathan wrote *The timer can be constrained to cases where a write (perhaps only
to a buffer) has already occurred and an unfilled buffer exists.*

I saw that in commit
[70a407c245](https://core.tcl-lang.org/tcl/info/70a407c2453e5aa3). (I missed the
comment in some other ticket; one consequence of discussion being spread across
three tickets). I'm afraid this only illustrates the problem with the current
design. It forces the channel core to **attempt to deduce information it does
not have**. Here it is attempting to figure out channel state based on buffer
state. Not only is this just a "guess", you have now made event generation a
dependency on the buffer management code. A change to preallocate channel
buffers, for example, would break this "guess". In SE terms, high coupling
between components that are conceptually independent. This is not good design
for maintainability.

pooryorick added on 2024-04-03 19:47:55: (text/x-fossil-wiki)
You mean this fix that has been in place since 2019?  Not unless this failure has been occurring for the last 4 years.

jan.nijtmans added on 2024-04-03 09:21:14: (text/x-fossil-wiki)
I'm seeing the http-3.26.1 testcase fail sometimes on MacOS. Could this 'fix' be the cause of that?

See: [https://github.com/tcltk/tcl/actions/runs/8534740843/job/23379729000]

pooryorick added on 2024-04-02 12:36:43: (text/x-fossil-wiki)
Adjusting the timer like that is one way to achieve rate control, but
Ashok is arguing that rate control should be achieved by waiting for the
device to signal that it's ready for more writes.  The problem is that
Tcl writes to a buffer without actually calling the <code>[write]</code>
procedure that a refchan provides.  The <code>[write]</code> procedure
is the logical (and only) place for a refchan to signal that its ready
for another write more, but because of this "write to a buffer instead
of performing a write to the device" phenomenon, refchan must
continuously signal that it's ready for a write, which means setting up
a repeatedly-scheduled event.

I think one thing we can all agree on is that if a device signals that
its ready for a write, and Tcl writes to a buffer instead of actually
performing a write, then Tcl can assume that the device is still ready
for a write, and can safely generate a "writable" event again.  We
should be able to develop an improved approach based on this assumption,
and the new approach will eliminate the need for refchans to set up such
repeatedly-scheduled events, and probably also eliminate the need for
that <code>CreateChannelScheduledEvent()<code> function that sebres just
pointed to.

sebres added on 2024-04-02 11:03:00: (text/x-fossil-wiki)
Just few cents from my side about SYNTHETIC_TIMER - in my event-perf-branch from RFE [fdfbd5e10], I rewrote this synthetic timers with a [/artifact?name=5c08fdd1fd1a4d32&ln=8406-8419|scheduled event] which can be retarded inside the handler if it is not yet ready.

pooryorick added on 2024-04-02 10:26:35: (text/x-fossil-wiki)
TLDR:  If you can find an acceptable way to eliminate that timer, then
I'm all for it.  Just don't revert this change wholesale.  The situation
calls for more effort than that.

Next...

I understand all of the principles that you've explained, and in theory
agree.  The thing is that Tcl provides a variety of knobs to achieve
various behaviours, and also provides refchans, which must function even
in the absence of external state changes that typical channels have.
You're arguing from a matter or principle, but the trick is to
accommodate all of these behaviours and contexts in practice.  If you
can eliminate that timer without breaking anything, than please do!
What I'm arguing against is just reverting this change without providing
any other fix, and without even providing an example of something that
is broken and can't be fixed by any of the existing knobs.  I have shown
how to asynchronously connected by using <code>[socket -async]</code>
and not configuring the channel as nonblocking until the "connected"
event is handled. No one has made test io-14.11.0 reliably fail in the
error case, so it's currently not a meaningful test.  The idle queue
starvation you mentioned was easily fixable separately.  Until there is
a concrete example showing something that needs fixing, the best way to
continue is for any interested party to make improvements to existing
code, and not to revert the change that fixed this issue.

I'm not fond of that timer either, and I've even provided an idea for
taking steps towards eliminating it.  Perhaps that got lost in the
noise:  The timer can be constrained to cases where a write (perhaps
only to a buffer) has already occurred and an unfilled buffer exists.
This would move things in the direction you're advocating by only
employing the timer when it is really necessary.  If you have time, you
could do that work, or I'll do it when I have time, or maybe someone
will come up with a way to eliminate the timer entirely.  Let's work on
it, and not just revert changes that have been in place for four years
already, and that do solve issues like the one described in the current
report.

All a refchan should have to do is use [chan postevent] to indicate that
it is writable.  Yes, refchans currently set up timers keep those
writable events posting, but as you've just elegantly explained, that's
a bad hack that violates the principles of asynchronous channels.  So
let's not enshrine that in the design, but instead get to work on the
internals so that things remain simple at the script level, and we can
continue to improve the situation in the implementation as intelligence
and time resources become available.

apnadkarni added on 2024-04-02 07:35:31: (text/x-markdown)
Nathan,

I'll make one last attempt at persuasion that

- from an architectural point of view, generating I/O callback events without
knowing channel state is not at all an appropriate model for async event driven
I/O.

- The related changes in trunk vis-a-vis 8.6 are broken in multiple aspects,
mostly because of the above.

The motivation for async i/o is that applications can do useful work while
waiting for I/O to complete. Now this by itself does not need select on Unix,
completion ports on Windows or fileevent in Tcl. One can just call the i/o
functions and check for "EAGAIN" equivalents. There are multiple problems with
this as the application has to poll:

- Try too frequently and its wasting processing cycles processing EAGAINS.

- Try too infrequently and there is an unnecessary delay / latency in servicing
I/O requests.

An event system solves both the above **as long as events are generated based on
the I/O state.** No time is wasted in unnecessary polls and there is not
additional delay once the channel is ready for I/O. However, if I/O events are
generated based on timers **with no knowledge of I/O state** it has exactly the
same effect as application polling! It is completely pointless - generating
events when I/O state does not reflect channel readiness, and unnecessary delays
after readiness before the timer expires. If you truly believe this as a
solution, you should be amenable to completely getting rid of Tcl's I/O related
event subsystem! The application could just generate write events using `chan
postevent` on a regular basis with `after`. This is basically what the current
Tcl 9 implementation does, queueing events on a timer basis.

The above is the motivation for async/event-driven operation from an
efficiency and performance perspective. However, there is also the "simple"
matter of correctness. An event should not be generated prematurely
reflecting a state that does not actually exist.

In other words, from both the performance and correctness point of view, the
current timer based write event generation in ChannelTimerProc **which is not
based on channel state** is fundamentally flawed. It does not fulfil the
intended purpose of an event based i/o system by essentially polling, and
moreover does not meet correctness criteria as it has no idea of channel state
and generates events prematurely (14.11 failure).

All the above is a comment on event driven i/o and channels in general. Now as
far as refchans are concerned, there is a limitation in the refchan framework as
mentioned in an earlier post which led to the original defect you logged in this
ticket. To reiterate, there is no script level equivalent of the
`Tcl_EventSetupProc` and `Tcl_EventCheckProc`. Thus *some*, not all, refchans
are forced to use a timer base scheme to generate these events. However, **this
must be done by the refchan script implementation itself and not the generic
channel infrastructure** because the former knows the channel state, the latter
does not. This is still not ideal from the efficiency perspective, but being
able to check state, it is at least correct.

I believe that is how Andreas' virtchannel modules in tcllib work. And as proof
of concept, following the tcllib model, I've modified your refchan
implementation of 44.6 and attached as refchan-async-redux.tcl (proof of concept
only modeling tcllib). This works in Tcl 8.6 as well (which your version did
not). TL;DR the changes you made to the core in ChannelTimerProc (a) lead to at
least two bugs 11.14 and event q starvation, (b) affected channels other than
refchans, (c) were unnecessary.

Given (imho) the enhancement of the refchan framework as too much of a risk for
a 9.0 release, I see two possibilities that are acceptable (not perfect, just
acceptable) for 9.0 release:

- revert the implementation to what 8.6 does. No need for -buffering none in
this case but the script level refchan implementation has to generate timer
events, **check state** and then do a `after idle after 0 chan postevent` from
the timer callback. See tcllib or attached sample. Alternatively, the refchan
can do the -buffering none itself and avoid the timer if that suits its purpose.

- If you do not want the channel script implementation to have that responsibility,
(I would like to know why not) then set -buffering none for refchans as
proposed in my branch.

I prefer (1).

I am pretty much going to stay silent on this topic now. I cannot provide any
more clarity on my objections. Finally, some group of people has to decide
on a course of action. Hopefully, that group is not just you and me.

apnadkarni added on 2024-04-02 03:16:12: (text/x-markdown)
Nathan,

*I've created a new ticket, [ac7592e73c10d04b], for the idle queue starvation that Ashok provided an example of, and also fixed the issue by having tclIO.c/ChannelTimerProc() us the idle queue instead of the standard queue to reschedule itself.*

The following concerns come to mind with this change:

- Changing to the idle queue would seem to have similar effects to changing the SYNTHETIC_TIMER from 0 to a positive number. The comments for that #define specifically warn against that.

- Posting to the idle queue when other file events are posted to the event queue seems like it could lead to the reverse situation where your posted events are starved.

- Given that event processing in `Tcl_DoOneEvent` and its ilk allows filtering based on flags `TCL_FILE_EVENTS|TCL_TIMER_EVENTS|TCL_IDLE_EVENTS` this further confuses what those flags mean. Will your events fire in a TCL_FILE_EVENTS loop?

However, I am not going to spend time chasing the above when I believe the fundamental design itself is flawed, irrespective of whether you use the timer queue or the idle queue.

apnadkarni added on 2024-04-02 02:49:54: (text/x-markdown)
Nathan,

The reason I backed out your test case modifications were differing semantics. You added an additional vwait command in effect creating a delay that allows the connection success/failure to happen in time! I do think the original test could have been better targeted because the puts is irrelevant to the actual bug. I may add a separate test case in my branch to illustrate the actual connection failure but don't want to muddy the waters here.

And regarding your comment -

*If you want to prove your point, and also show that you're not just holding my changes hostage to win this argument, you'll make if fail consistently on all platforms so that it has meaning as a test.*

Some test cases, by their nature, cannot be made to fail reliably on every
platform every time. This can be either because the failure is timing dependent,
depends on the libraries in use, operating system versions etc. Examples include
failures caused by race conditions in multithreaded cases - it may take millions
of iterations to cause a failure, may depend on processor speed, number of
processors etc. So also, 14.11 is timing dependent. There is effectively a race
between the connection completion and delivery of the write event.

FWIW, on my system 14.11 fails every single time. I do not know if that
reproducibility is specific to Windows or to my system. Don indicated it is
sporadic on his (non-Windows) system which is an indication it is timing
related.

So yes, while ideally failures would be consistently reproducible, across all
platforms all the time, that is not always possible. Does not make the test
invalid.

And in cases like this, where the cause of the failure is ascertained to be the code being tested, all the more incentive to keep the test.

Think of test failures analogous to the user reporting a crash. I suppose your response would be "Oh unless you can show that happens on all systems, it is not a bug"!

pooryorick added on 2024-04-01 21:16:02: (text/x-fossil-wiki)
Given that <code>[socket -async]</code> returns a blocking channel, it's more likely that the designers intended it to remain blocking until the connection is confirmed to be established.

pooryorick added on 2024-04-01 21:13:15: (text/x-fossil-wiki)
The examples on the wiki were written over the course of the last 20 years, so I don't know which "bug" you're talking about, given that the changes Ashok and I are debating were introduced more recently.

jan.nijtmans added on 2024-04-01 20:47:32: (text/x-fossil-wiki)
> Almost all of the examples of [socket -async] on the wiki set the channel to nonblocking only after the connection is established. Seems like historically, people understood the right way to use [socket -async].

It's more likely that many people found out that setting [socket -async] from the start doesn't work right, because the Tcl implementation is buggy. Why setting up the channel cannot be asynchronious? Let's fix that!

pooryorick added on 2024-04-01 20:33:30: (text/x-fossil-wiki)
If 14.11.0 was intentionally written to test this particular thing, then
why is it only sporadically failing now, and only on some platforms?
Also, why do the other tests that use the same pattern not fail then?
Written as they are, only to a some-value-of-"works" standard, these
tests are at least a little bit suspect.  Almost all of the examples of
<code>[socket -async]</code> on the wiki set the channel to nonblocking
only after the connection is established.  Seems like historically,
people understood the right way to use <code>[socket -async]</code>. 

Regarding the fix for [ac7592e73c10d04b], care to explain how it "hides
the real problem"?  I think the fix gets a red herring out of the way so
we don't waste time discussing it.  The timer itself is debatable, but
we might as well debate a correct implementation of it.

jan.nijtmans added on 2024-04-01 18:19:25:
> Ashok, since you undid my change to socket test 14.11.0 ...

Ashok's action was IMHO fully justified. @nathan, please don't change a testcase which is written as-such on purpose. Ashok is not holding your changes hostage to win the argument: the goal here is to come up with a _real_ solution.

About the change from timer event to idle event: It's - indeed - actually the same, but it also hides the real problem. Not a good change to make directly on trunk, while discussion is ongoing. It's also not bad enough to start a 'commit war'.

pooryorick added on 2024-04-01 00:29:51: (text/x-fossil-wiki)
I've created a new ticket, [ac7592e73c10d04b], for the idle queue starvation that Ashok provided an example of, and also fixed the issue by having tclIO.c/ChannelTimerProc() us the idle queue instead of the standard queue to reschedule itself.

pooryorick added on 2024-03-31 17:39:07: (text/x-fossil-wiki)
Ashok, since you undid my change to socket test 14.11.0, it is now back to failing sporadically.  If you want to prove your point, and also show that you're not just holding my changes hostage to win this argument, you'll make if fail consistently on all platforms so that it has meaning as a test.

pooryorick added on 2024-03-31 17:03:58: (text/x-fossil-wiki)
I don't see anything wrong with delaying the writable event notification
until a connection is established, but that should be done in such a way
that this issue remains fixed.  I have an idea for that:  The issue here
is that "writable" event notice happens once, then a "writable" event
happens once but then never again because Tcl writes to a buffer instead
of calling the "write" procedure for the refChan.  I think it's pretty
clear that refchan implementations shouldn't be burdened with the duty
of maintaining timers, so somehow, Tcl must keep the "writable" event
signaling until the "write" procedure of the refchan is actually called.
The timer in UpdateInterest() is currently taking care of that.  There
is an opportunity to have Write() set up the timer instead, since that's
the procedure that would call refchan if it a buffer was full.  In that
case, the timer-based "writable" event would only happen after some
initial event has lead to a call to Write().

apnadkarni added on 2024-03-31 16:31:32:
Nathan,

Regarding the GUI freeze, I put it here because the root cause is the exact same. If you trace through the code you will see ChannelTimerProc queueing the timer event which gets placed on the event queue, calling ChannelTimerProc queueing the event, which gets placed on the event queue ... Event queue never gets drained so the idle queue never runs.

I'd rather not have the same discussion under 3 tickets. If you find some other root cause and I'm mistaken, you can move it to another ticket.

/Ashok

apnadkarni added on 2024-03-31 16:26:23:
@dgp, I had looked at that ticket you mentioned earlier. I think that is a much broader and trickier scope.

apnadkarni added on 2024-03-31 16:24:20:
Nathan,

I have moved your change to 14.11.0 to a mistake branch.

This is the first time I have reverted someone else's changes, sorry, but it is simply not acceptable to change an perfectly functional test so as to not exercise a bug. The changes you made are not semantically equivalent.

You have repeatedly tried to justify your design by arguing that functional, valid user code to be changed. This is another example.

/Ashok

pooryorick added on 2024-03-31 16:23:19: (text/x-fossil-wiki)
Ashok, that last comment belongs in its own ticket so it can be properly diagnosed and remedied separately if possible.

dgp added on 2024-03-31 16:14:41:
This may be off-topic.  Ignore or delete if it is.

I see much activity regarding events and channels, and I wonder if
any of it is relevant to ticket

https://core.tcl-lang.org/tcl/tktview/ef28eb1f15167156c898

If there's a chance to tackle that too while folks are already in the code.

apnadkarni added on 2024-03-31 15:49:13: (text/x-markdown)
As it stands, there is another problem with the current implementation. If run in wish, it freezes the GUI as the idle events never get to run. I suspect the write event keeps getting queued after an interval of SYNTHETIC_EVENT_TIME which happens to be 0 which means the event queue is never drained. Test io-44.6 which tests the implementation never actually tests that other events are not blocked.

The attached script illustrates this. This is essentially a modified version of io-44.6.

With tclsh 8.6, the time is printed every second

```
c:\src\tcltk\wip\tcl\tests>tclsh86t ..\tests\refchan-async-test.tcl
1711899031
1711899032
1711899033
1711899034
1711899035
^C
```

With 9.0 however, after the initial invocation the time is never printed

```
c:\src\tcltk\wip\tcl\tests>d:\tcl\90\x64-debug\bin\tclsh90.exe ..\tests\refchan-async-test.tcl
1711899100
^C
```

The branch apn-bug-de232b49f2 works properly

```
c:\src\tcltk\wip\tcl\tests>d:\tcl\wip\x64-debug\bin\tclsh90.exe ..\tests\refchan-async-test.tcl
1711899197
1711899198
1711899199
1711899200
^C
```

Now could one change SYNTHETIC_EVENT_TIME to something other than 0? The comment there says *Must be zero or bad things tend to happen*. Could it possibly do the equivalent of "after idle after 0"? I suspect this will have the same undesirable effect as changing SYNTHETIC_EVENT_TIME and anyways feels like a patch for a patch for fundamentally broken design.

The above illustrates tclsh but the same effect is seen with wish.

```
d:\tcl\wip\x64-debug\bin\wish90.exe ..\tests\refchan-async-test.tcl
```

pops up a text widget that can be edited. On the other hand, with the existing 9.0, the text widget is shown but no characters are displayed when you type and the GUI appears frozen.

This. Polling. Based. Design. For. Events. Is. Just. Plain. Wrong.

/Ashok

pooryorick added on 2024-03-31 15:45:57: (text/x-fossil-wiki)
There has been an example on the wiki for 20 years already showing how to do port scanning in pure Tcl by connecting asynchronously to each port, and it wasn't broken by the fix for this issue because it never sets the channel to nonblocking: [https://wiki.tcl-lang.org/page/Port+scanning+in+tcl|Port scanning in tcl].

pooryorick added on 2024-03-31 15:39:56: (text/x-fossil-wiki)
I did address the port scanner issue by providing example code for asynchronously detecting when a port is open.

pooryorick added on 2024-03-31 15:37:56: (text/x-fossil-wiki)
Also, in [1b9bd55b7af13673] I just fixed socket test 14.11.0 so that it will no longer sporadically fail on some platforms.

pooryorick added on 2024-03-31 15:23:51: (text/x-fossil-wiki)
Ashok wrote, "It is about the use of writable events to asynchronously detect
when a connection is open."

Again:

    <code><verbatim>
    set so [twapi::tls_socket -async www.example.com 443]
	    fileevent $so writable [list apply {chan {
		    #chan is now connected
		    fconfigure $so -blocking 0 
		    fileevent $so writable [list apply {chan {
			    #do stuff with connected channel
		    } $so]
	    }} $so]
	    vwait ...
    </verbatim></code>


That example asynchronously detects when a connection is open, so we're done
here, right?

apnadkarni added on 2024-03-31 14:06:03: (text/x-markdown)
Nathan,

Most of your post focuses on something that is not crux of the use case being
discussed. 

This is not about the use of writable events to ascertain when a channel can be
written to. It is about the use of writable events **to asynchronously detect
when a connection is open**. 

From the socket -async docs,

*Running the event loop also allows you to set up a writable channel event on the
socket to get notified when the asynchronous connection has succeeded or failed.*

The complaint I have is that in 9.0 the writable channel event is generated
prematurely - **before** the connection has succeeded or failed. While I might
have used the "writable event" loosely, it is in this context - signifying the
connection is open.

While you airly dismiss test the long standing socket-14.11.0 as a broken test,
this test is almost exactly how the simplest of network applications - port
scanners - would be written ("almost" because a port scanner would not do the
puts). Your "fix" makes it impossible to write a reasonable (synchronous or
polling is not reasonable) one in Tcl. I do notice you did not address the port
scanner example or the address logging example in your response.

And to the specifics of your comments...

*...twapi related example..* - no, your code fragment, does not solve the
problem. When the lambda fires, the connection is not complete. The fconfigure
and fileevent following are completely irrelevant. See above. Even otherwise,
the idiom of of socket -async followed by fconfigure -blocking 0 has been around
forever. There has been nothing to forbid it. There is no justification for
asking that to be reworked as a workaround for a buggy core change.

*...Here's another example...* about puts after a socket -async. Again, 
irrelevant. puts is not the issue. See above.

*...socket-14.11.0 failures...* Oh if only we could fix all test failures by
disparaging the test! This is a perfectly reasonable test reflecting a real
world use case. I also do not understand your purpose behind quoting the manpage
flushing /closing the channel. That is completely irrelevant to this issue.

*...I believe this existing fix is theoretically correct...* Outside of some
special cases, a timer based event generation scheme for I/O is completely
pointless. It defeats the entire purpose of event based I/O. Imagine the Unix
select call, which serves a similar purpose, marking descriptors as "ready"
based on a timer as opposed to actually being ready.

**Timer based event generation for i/o is essentially polling at regular
intervals. There is no point to such an "event driven" i/o system where the
"event" is a timer as opposed to i/o! An application might as well just poll
itself using after!**

*A few years ago I submitted issue 67a5eabbd3 and fixed it* Relevance? No one is
disputing refchan limitations but my -buffering none branch does not break that
issue either. I could not find a relevant test case but I manually tested the
script in that issue.

*I'm not in favour of changing the default buffering for refchans.* It would
help if you pointed out the technical issues or negative practical impact as to
why. I've already stated in my earlier post that fixing refchans would be
preferable but I do not think that is feasible in a reasonable time frame. An
opinion that "it is a knee jerk reaction" is neither here or there. Your
reference to the 080f846fd58958ef is a red herring because you did not address
the actual issue being raised there.

pooryorick added on 2024-03-31 01:12:56: (text/x-fossil-wiki)
Yes, I incorrectly claimed that [chan configure ... -blocking] was
redundant with [socket -async]. I forgot that [socket -async] returned a
blocking channel.   However, the fact that [socket -async] returns a
blocking channel provides the solution to [080f846fd58958ef]:  Don't
immediately configure the channel to be nonblocking, and instead do
this:

<code><verbatim>
	set so [twapi::tls_socket -async www.example.com 443]
	fileevent $so writable [list apply {chan {
		#chan is now connected
		fconfigure $so -blocking 0 
		fileevent $so writable [list apply {chan {
			#do stuff with connected channel
		} $so]
	}} $so]
	vwait ...
</verbatim></code>

Problem solved:  The event will be triggered only when the channel is
connected.  Now, on to the issue of "writable" events on non-blocking
channels.  If a channel exists and is nonblocking, it's writable.  Any
other state of the channel, whether it is error state or connection
state is not a factor.  That's how nonblocking channels have always
functioned in Tcl, and if you think about it, that's the way it must be:
If something is blocking the writability of a channel, then it's
no-longer nonblocking.  That's why it has always been the case that for
a nonblocking channel [puts] never fails, and also why "Tcl continues to
buffer the data and writes it in the background as fast as the
underlying file or device can accept it." ([puts] documentation), and
why [close] is the operation that reports the error, if there was one:
There's no other strategy that is practical.

Here's another example:

<code><verbatim>
	set chan [socket -async <some address>]
	chan configure $chan -blocking 0
	puts $chan hello
</verbatim></code>

Here, it's entirely possible that the socket will not be connected yet,
but when [puts] executes it will never fail, nor should it.  That's the
point of configuring a channel to be nonblocking:  It doesn't ever
block.  By definition, if a channel never blocks, it is always writable.
Over in [080f846fd58958ef] Ashok wrote, "Under no circumstances should a
file event writable callback be invoked unless the channel is actually
open (writable) and the above does not make that check."  As we see in
the example above, that's wrong:  [puts] doesn't fail, and if it doesn't
fail, then the channel was writable.  If the channel is writable,
there's no problem triggering a corresponding event on the the channel.
Not only that, but the fire-and-forget aspect of nonblocking channels is
very convenient, particularly at the script level.

I incorrectly claimed that the test 14.11.0 failure we were talking
about was the case where it produces an error.  The actual situation is
that the test fails if an error does not occur, but I believe it is by
design that sometimes there is no error.  The documentation for [close]
states,

  "If the channel is nonblocking and there is unflushed output, the
  channel remains open and the command returns immediately; output will
  be flushed in the background and the channel will be closed when all
  the flushing is complete."

That means that for systems that take their sweet time figuring that a
connection can't be completed, [close] will in the meantime return
without error, which in turn means that test 14.11.0 is only testing
circumstance rather than anything meaningful.  Perhaps that test should
be deleted, or perhaps a pause should be introduced to let the system
finish meditating on the state of the connection.

I believe that the existing fix for this issue is theoretically correct,
and should remain in place.  A few years ago I submitted issue
[67a5eabbd3] and fixed it.  Prior to this, refchans were thoroughly
broken in an subtle way, and I believe this caused refchan programmers
over the years to throw up their hands and just walk away from refchans,
or to invent twisted cargo-cult approaches to keeping their refchans
responsive.  That issue is related to this one.  Not taking the approach
that has now been established for triggering writable events on
asynchronous channels risks making them unusable again in ways that are
too subtle for most programmers to ascertain, and in my experience means
that we'll just keep playing whack-a-mole with issues related to
handlers not ever getting triggered even though the should be.
 
I'm not in favour of changing the default buffering for refchans.  That
would be a knee-jerk reaction to a particular problem rather than a
well-considered design, and in issue [080f846fd58958ef] there is so far
no actual example of a broken script that would warrant such a change.

jan.nijtmans added on 2024-03-30 16:52:11: (text/x-fossil-wiki)
> Jan, I do not think you have fully understood the issue

Indeed, I don't understand 100% of all details, but globally, yes I do.

Switching buffering off for refchan's by default sounds OK to me: Buffering is meant to assure that the channel can continue operating if the next step is blocked. So, in a chain, if one of the elements has buffering on, all the other channels are able to operate through that.

Also, delaying the timer even until after the channel is fully open sounds OK to me. But it would be good to assure it doesn't break any outside channel implementation. Better be a little bit too careful here.....

Anyway: discussion about this is a good thing happening! You are bringing in a lot of arguments and examples! That's constructive, and - hopefully - helps finding a solution.

apnadkarni added on 2024-03-30 15:59:38:
With respect to side effects with my branch setting refchan buffering to none by default, I have not seen any after experimenting for a while. All tests pass on Win and Linux and for the most obvious consequence - performance - there is no discernible difference:

With current trunk:

```
% timerate -calibrate {}
0.08983439299958879 µs/#-overhead 0.090688 µs/# 21524429 # 11026859 #/sec
% set fd [tcl::chan::memchan]
rc0
% timerate {puts -nonewline $fd x}
7.680620 µs/# 128693 # 130197 #/sec 988.442 net-ms
% seek $fd 0
% string length [read $fd]
128693
```

With the -buffering none based solution

```
% timerate -calibrate {}
0.09071036436181426 µs/#-overhead 0.091423 µs/# 21351367 # 10938200 #/sec
% set fd [tcl::chan::memchan]
rc0
% timerate {puts -nonewline $fd x}
7.687302 µs/# 128568 # 130084 #/sec 988.341 net-ms
% string length [read $fd]
0
% seek $fd 0
% string length [read $fd]
128568
```

I suppose one might expect this. Buffering is really meant to mitigate the cost of disk i/o and system call transitions. There is none with reflected channels.

An alternative, but similar effect, would be to set buffersize to 1 but the related code is a little more convoluted and I don't think offers any benefit over -buffering none.

I'm inclined to favor the -buffering none solution. It handles not the socket test and Nathan's coro test. Docs will have to note refchans do not get buffering by default. Also, we would have to decided whether the application can change the buffering mode (I would say yes as long as the potential issues are noted in the documentation. If that is not desired, we could simply force the buffering to be -none and ignore / raise error on any requests for change).

apnadkarni added on 2024-03-30 15:38:55: (text/x-markdown)
While we are on this code, another doubt regarding the surrounding code in question comes to mind. In 8.6, the code fragment that checks for **read** event generation was always executed (assuming the appropriate conditions were met). In 9.0, if the new write event checks (that we are discussing) match, the read side is not checked at all. Is this intended? Not directly relevant to current discussion but I'm curious to know the reasoning.

Code in question is
https://core.tcl-lang.org/tcl/file?ci=8a8722e215ef8d63&name=generic/tclIO.c&ln=8761-8793

apnadkarni added on 2024-03-30 12:02:18: (text/x-markdown)
A channel writable event is being generated by a piece of code on a periodic basis with no knowledge of the channel state and whether it is actually writable or not. This is so broken as to be self-evident in my opinion and should need no debate but to answer Nathan's points below.

The failure of socket-14.11.0, marking a connection when it is not yet open, or  is not a trivial matter. That specific scenario is exactly one that might be used to implement a port scanner (which always operate asynchronously for obvious reasons). The current trunk implementation will show all ports open even when they are not! And note the script is not aware the channel is broken as you seem to think!

Consider the complementary case where they are open. The event fires BEFORE the TCP connection actually completes, the write handler runs and writes connection details to syslog. Not an uncommon scenario but the fconfigure that retrieves address/port information returns empty strings because the connection has not completed making the logging useless.

A third example - admittedly a code inspection based conjecture; I have not tried this as I don't have a multihomed machine. Tcl's socket implementation is smart enough that if the destination hosts resolves to multiple addresses, it will try each in turn until one succeeds. The write event will trigger on the first *attempted* connection which may actually fail when a subsequent candidate succeeds.

Now with respect to the `-async` option, let's ignore twapi because it simply clones Tcl socket semantics. The two illustrative fragments you gave are very different semantics. When the -async option is specified, the command returns without waiting for the TCP connect to be completed and connection open. The application can continue other work until the connection is open. Without the -async option, the application is blocked from any other work until the connection is open, which may be several or even tens of milliseconds. Changing socket with fconfigure -blocking 0 cannot change this (in the second example!). -async does not make the socket non-blocking, and -blocking 0 does not make the connection asynchronous (how can it when it can only be done afterward!). So I completely fail to understand your comment of redundancy.

The fundamental problem here is that the reflected channel framework has incomplete support when it comes to event driven I/O. All channel drivers I have written or seen have two components. The first is the channel driver itself - in this case the tclrchannel driver. The second component of the driver hooks into Tcl event processing via the Tcl_EventSetupProc / Tcl_EventCheckProc callbacks. These are where the channel implementations generate notification events but the refchan framework has equivalent. The correct fix would be to add the equivalent facilities to refchan, or alternatively enhance the channel driver API (and refchan) to be able to exchange state information that would let the common channel code make proper decisions that depend on channel state.

Both are likely to be non-trivial and tricky to get right, possibly why they were never implemented, and probably not feasible in the 9.0 time frame.

As an alternative solution, I propose turning off buffering for refchans (either via an internal -buffering none or a -buffersize 1). This forces calls to refchans always to be passed down to the refchan script which is then responsible for generating events exactly as Nathan's test code does. This experiment is in branch [apn-bug-de232b49f2](https://core.tcl-lang.org/tcl/timeline?r=apn-bug-de232b49f2). There are some side effects that I will detail later but for now, all tests pass (both socket-14.11.0 as well as io-44.6 which tests Nathan's coroutine based driver).

pooryorick added on 2024-03-29 21:51:53: (text/x-fossil-wiki)
The refchan issue described here is far more serious than the failure of
test socket-14.11.0, because in a situation like the one illustrated by
that test, at least the script becomes aware that the channel is broken,
and has the opportunity to respond somehow.  In contrast, any script
that falls victim to the refchan issue described in this ticket, it has
no recourse, and simply fails silently and irretrievably.

There are many ways in which channel can become broken, and a socket
failure to connect is for practical purposes not much different than any
of those other failure modes.  In my opinion, "async" should mean async
from start to finish, and the script itself can decide how to handle the
various possible failures in an asynchronous way. 

In the meantime there are ways to avoid the problem Ashok illustrated
with twapi in [080f846fd5].  For one thing, the example is a little
strange:

  <code><verbatim>
      set so [twapi::tls_socket -async www.example.com 443]
      fconfigure $so -blocking 0
  </verbatim></code>

The socket is opened in asynchronous mode, but the for some reason it is
still necessary to set the blocking mode of the resulting channel to 0.
Why the redundancy?  I would guess that if the "-async" argument were
removed, then twapi would wait for the connection to complete, and then
the rest of the code would continue to function asynchronously:

  <code><verbatim>
      set so [twapi::tls_socket www.example.com 443]
      fconfigure $so -blocking 0
  </verbatim></code>

The same is true for socket-14.11.0.  The documentation for the "-async"
option to [socket] has long described the possibility:  "This means that
the socket will be created immediately but may not yet be connected to
the server, when the call to socket returns."

So even just following the documentation will avoid the issue Ashok has
raised.  I would not revert this fix.  There may well be a more
satisfying fix, and if it is found, it can of course replace this one.

jan.nijtmans added on 2024-03-29 20:46:23: (text/x-fossil-wiki)
@nathan, can you repond? Would it be best to revert this 'fix', and find a better one?

apnadkarni added on 2024-03-29 05:40:37: (text/x-markdown)
Marking this as not fixed as I believe the "cure" is worse than the disease. The change results in bug [080f846fd5](https://core.tcl-lang.org/tcl/tktview/080f846fd5). See that ticket for details but in a nutshell generating "writable" events without ascertaining that the channel is in fact open is fundamentally wrong.

Chasing that bug also led me to (finally) chase down the cause of the socket-14.11.0 test failure (that Don reported subsequent to this commit which is completely reproducible on Windows and something that I've been meaning to track down for a while) which led back to exactly this again. The write handler is invoked before the channel is actually open. This should never be the case.

Some other solution to the original failure in this ticket is needed other than generating writable events on a timer. At the very least some additional state checks that the channel is open is needed.

dgp added on 2021-05-27 15:19:23:
I still frequently (though not always) see test socket-14.11.0 fail.

pooryorick added on 2021-05-27 06:29:04:
The issues mentioned may have been fixed by subsequet work.  Are there any outstanding issues related to this ticket, or can it be closed now?

dgp added on 2019-05-06 13:50:46:
Failures of chan-io-44.6 may be arising here too.

dgp added on 2019-05-06 13:17:19:
The test socket-14.11.0 started failing with this change. Should we resolve
here or I open a new ticket?

pooryorick added on 2019-05-01 11:00:15: (text/x-fossil-wiki)
Further explanation:  The documentation for [chan event] states that a channel is considered to be writable if at least one byte of data can be written to the underlying file or device without blocking, or if an error condition is present on the underlying file or device.  However, for better performance, Tcl produces writable events until the internal buffer is flushed.  This fix extends this behaviour to reference channels.

pooryorick added on 2019-04-27 07:24:06: (text/x-fossil-wiki)
Fixed in [d0dd6d19a41ae7af].

Attachments: