Tcl Source Code: Rationale for rollback of refchan event generation in core

All,

Below is my rationale for requesting a rollback of certain post-8.6 changes related to event generation for refchans. Thanks to Jan, the experimental branch bug-18f4a94d03 contains the rollback.

In summary, the changes

were unnecessary as they were made in response to illustrative scripts that were incorrect
even if necessary, they changed the behavior of public commands and should therefore have been TIP'ed
completely nullify the purpose of event driven i/o

The below merely consolidates the discussion in various related tickets. In (roughly) chronological order:

Commit e066e24f

The commit changed the chan postevent from synchronous to asynchronous operation as a fix for ticket 67a5eabb.

TL;DR This change was unnecessary and if at all shown to be so, should have absolutely required a TIP.

For starters, the test case motivating the fix is itself flawed. The coroutine busy error was caused by the refchan implementation directly invoking chan postevent which results in a nested coroutine invocation. For these use cases, in particular where the refchan implementation is non-reentrant, it has always been the case that refchan implementations post events using the after idiom as described in Andreas' 2009 Tcl conference paper. As @aspect noted in the same ticket this is also true of the tcllib refchan implementations. It also is true of TWAPI's TLS implementation.

This change resulted in @bll logging ticket 18f4a9 illustrating a failure in previously working code. @pooryorick marked this as invalid because event loop was not running (as per my understanding). Without commenting on the merits either way, it is still the case that working refchan implementations that correctly used the after idiom will now have events passed through the event queue additional times for no good reason.

In any case, a change in a command behavior from synchronous or asynchronous should require a TIP. Paraphrasing @dgp's post, this results in (a) loss of the ability to choose synchronous and asynchronous operation, and (b) an incompatible loss of function in a public command.

Commit d0dd6d19

As a fix for de232b49, buffered coroutine write events not called, the commit introduced the continuous generation (roughly equivalent of after 0 but at the C level) of write events (when a write event handler is registered on a non-blocking channel) within the channel core implementation.

TL;DR Again, as in the previous commit, the change was unnecessary. It also resulted in new bugs.

As stated above, refchans are responsible of generation of their own events and the test example in the ticket is flawed in its implementation. It should have set up a timer itself at the script level as demonstrated in one of the attachments to that ticket. Instead, the commit changed the Tcl core itself to generate write events in a continuous 0-interval timer with no knowledge of the actual channel state. As a result,

Test socket-14.11.0 started failing with this commit, sporadically on Unix as reported by @dgp, and consistently on Windows. Root cause was the fileevent handler being called before the connection was actually open.
Ticket 080f846f, test failures in TWAPI TLS. Same root cause as above.
The Tk UI freeze subsequently logged at ac7592e7 resulting from the continuous placement of writable events on the event queue.

Attempts were made to fix the above as summarized below but the bigger issue is that this is fundamentally broken design for more than one reason.

The events are generated with no knowledge of the device (refchan implementation) state. Thus the writable event that a connection was open (bugs 1, 2 above) would fire without checking the connection was open.
The documentation for fileevent states A channel is considered to be writable if at least one byte of data can be written to the underlying file or device without blocking.... Notifying a channel is writable purely based on a timer is in violation of that statement. As an aside, in the documentation of chan event, the wording underlying file or device was removed in commit eb627bda. This might have been just a side effect of an effort towards readability but it has semantic consequences as without those semantics writable events are pretty much useless. In any case, I restored the original wording as part of the chan manpage consolidation.
Generation of events on a continuous basis at regular intervals is no different than polling. As elaborated in this post it completely defeats the purpose of an event driven I/O architecture. If this was an acceptable solution, the entire event architecture for I/O in Tcl can be jettisoned. Just have a timer that generate events for all registered channels. No need to even check the channel state!
Generation of events with a regular interval of 0 is even worse. This is essentially continuous polling with all that implies! Bug 3 above (frozen UI) is just one consequence. This is really such a fundamental design flaw, no other justification for reverting should be needed.

@pooryorick has proposed fixes for the specific bugs listed above.

70a407c2 is proposed to deal with the premature connection open notification bug. While it may fix this specific issue, it does so by essentially guessing connection state based on buffer status. In doing so, it introduces dependencies between two unrelated and conceptually independent modules - buffer management and state-based event generation. Any future changes to the way the channel buffers are managed has the potential to break the latter. This is fragile and in SE terms, coupling between unrelated components is an anti-pattern for maintainability.
b2d083e9 is proposed as as a fix for the idle event queue starvation by moving the events from the event queue to the idle queue. My objections are noted at 934bb3a6 as has @sebres in the comments to ac7592e7.

In any case, aside from these misgivings about these fixes, more concerning is that these are point fixes and neither does anything to correct the fundamental point that continuous timer based generation of I/O events completely defeats the intent of event based I/O.

Rebuttals

@pooryorick has objected to reverting. See his comments and my responses in the various tickets linked above.

One additional point I want to make about fundamental motivation for the commits. Either

At the time of logging the "bugs" that motivated the changes, the author of the scripts was not aware of the refchans being responsible for generation of events themselves. The purported bugs in the core were really bugs in the script and did not require these core changes. The scripts should have been fixed.

The intent was as @pooryorick stated at a later time "I think it's pretty clear that refchan implementations shouldn't be burdened with the duty of maintaining timers". No, it's not clear at all. Since it's only the refchan implementation, and not the channel core, that knows the full channel state, it is the only component that can correctly generate events based on state of the "underlying file or device". The channel core cannot do this correctly until TIP 131 is implemented. And as far as burden goes, to put it bluntly, folks not capable of scheduling after timers have no business writing refchans. Ironically, @pooryorick in his comments suggests how previously working code that was failing due to the commits in question should be modified to work around the issues introduced, in effect "burdening" users of the channel with the duty of working around the introduced flaws!

Andreas Kupries' comments on the core list in support of reverting

[This response collects quotes from multiple mails received from Nathan]

> Hi everyone,
 
> A number of people, including apparently several members of the Tcl
> Core Team, have decided that [chan postevent] is part of Tcl's
> synchronous API for channels rather part of Tcl's asynchronous API
> for channels, and on that basis want to close the following issue
> reports
 
> 	https://core.tcl-lang.org/tcl/tktview/67a5eabbd3d195915d3ccc13246bdecf10e20726
> 	https://core.tcl-lang.org/tcl/tktview/de232b49f26da1c18e07513d4c7caa203cd27910
 
> The idea that [chan postevent] is part of Tcl's synchronous API is
> one of the most wrong-headed ideas I've encountered among the
> contributors to Tcl.

Strong claim. Emotional argument also, not technical.

> I'm sure that at least some people reading this have experienced
> failure when trying to implement a refchan, and as a result have
> simply turned their backs on refchans and found some other way to
> get the job done.  When [chan postevent] is part of Tcl's
> asynchronous API,

> as was always intended,

You seem to claim telepathy as well. Specifically that of my mind,
given that I was the writer of [TIP #219][1] back in 2004.

[1] https://core.tcl-lang.org/tips/doc/trunk/tip/219.md

Despite the distance of 20 years I am pretty sure that I had not
intended async operation.

And coming back into things after reading more mails I am now fully
sure it was not intended to be async. See later where I talk about
what a `refchan` truly is. A channel driver, just in Tcl. Operating
under the same rules as a C-level driver. Whose equivalent to
`postevent` is, yes, verily, synchronous.

> you wouldn't have experienced that "hanging" behaviour of refchans.
 
> For all those who have been bitten by refchans, now would be a good
> time to voice your support for fixing bugs keep it from being
> asynchronous as it should be.
 
> Here is a list of all wiki pages containing examples of postevent
> that are broken until the issues referenced above are fixed:
 
> 	https://wiki.tcl-lang.org/page/fifo2+in+pure+Tcl
> 	https://wiki.tcl-lang.org/page/reflected+channel+example
> 	https://wiki.tcl-lang.org/page/reflected+channel
 
> Here is a list of wiki pages that contain ugly hacks to work around
> the issues reference above:
 
> 	https://wiki.tcl-lang.org/page/A+template+reflected+channel
> 	https://wiki.tcl-lang.org/page/topcua

The hack you refer too are the `after ...` constructions used to push
`postevent` calls out of non-reentrant parts of the procs, right ?

I do not see such as hacks, but rather as the standard way to avoid
reentrancy troubles anywhere.
 
> The general principle here is that event handlers should not
> recursively enter event-handling, but should instead allow the event
> loop to mediate.  This should be obvious, so I don't know why we now
> have to debate it for [chan postevent].

> Was it ever documented to be synchronous?

No. Of course not. Because synchronous behaviour is the default for
pretty much all Tcl commands, and it is the __exceptions__ from that
default which are documented.

> Did any change I made require a change in the documentation?

Changing a command from sync to async is a strong change of the
semantics. Yes, that requires a change of the docs. Because, see
above, the async behaviour of a command is the exception from the
default.

>If not, no TIP is required to fix bugs in the implementation.

Just declaring something as a bug does not give you a
free-to-change-this card.

> Yes, really.  [chan postevent] is the mechanism provided for a refchan
> to notifiy Tcl of events on the channel.  At the C level, the mechanism
> provided to an event source for this purpose is Tcl_QueueEvent(), which

Actually the C level equivalent of `postevent` is `Tcl_NotifyChannel()`
which is very much synchronous. Somehow the C-level channel drivers
still manage to work.

https://www.tcl-lang.org/man/tcl8.6/TclLib/CrtChannel.htm

	Tcl_NotifyChannel is called by a channel driver to indicate to
        the generic layer that the events specified by mask have
        occurred on the channel. Channel drivers are responsible for
        invoking this function ___whenever the channel handlers need to
        be called for the channel___ (or other pending tasks like a write
        flush should be performed). See WATCHPROC below for more
        details.

(WATCHPROC) https://www.tcl-lang.org/man/tcl8.6/TclLib/CrtChannel.htm#M16

	... the channel driver is responsible for calling
	Tcl_NotifyChannel to inform the generic channel module. The
	driver should take care not to starve other channel drivers or
	sources of callbacks by invoking Tcl_NotifyChannel too
	frequently. Fairness can be insured by using the Tcl event
	queue to allow the channel event to be scheduled in sequence
	with other events

So, `Tcl_NC` is synchronous in C, and C-based channel driver are
responsible for using the C-level equivalent of `after` to explicitly
go through the event queue where/when needed, i.e. to avoid reentrancy
issues, etc. And these drivers __do so__.

A `refchan` is a __direct__ mapping of a channel driver into Tcl, and
therefore has the __same responsibilities__ as a C-level driver, up to
and including to explicitly go through the event queue when needed,
using `after` as the effective entrypoint to `Tcl_QueueEvent`.

A refchan directly talks to the low-level Tcl's IO system like C
drivers to, and has to follow the same rules as they do.


> My answer is that [chan postevent] was always sufficiently
> documented as asynchronous,

I see no such documentation in the `chan` manpage, nor in the TIP 219.
Please provide a reference to the docs you believe made the async
claim.

Poor Yorick's response

whether [chan postevent] was intended to be asynchronous, you had to
look decades back into the past and reconstruct your memories in light
of the current debate, and you have persuaded yourself that [chan
postevent] was intended to be synchronous.  I'll now attempt to persuade
you of the opposite.

In the small, Tcl_NotifyChannel() appears to be synchronous, but when
one steps back and looks at the larger picture, it is part of an
asynchronous operation.  It's true that Tcl_NotifyChannel() does not
queue up an event, but that's because Tcl_NotifyChannel() is itself an
event handler.  In tclUnixChan.c, which is the implementation of the
*nix channel driver for files, FileWatchProc() passes
Tcl_NotifyChannel() to Tcl_CreateFileHandler(), and the notifier then
calls Tcl_NotifyChannel() later when it detects an event on the channel.
This effectively makes Tcl_NotifyChannel() asynchronous, as its only
caller is the event loop itself.  When this also true for refchans, i.e.
when the only caller of a handler for an event on the refchan is the
event loop itself, then the operation of refchan is consistent with the
operation of other channels.  Unfortunately, due to the reversions
listed below, this is no-longer the case.

You named it [chan postevent] instead of [chan notify] because you
intuited that it was not at the same level as Tcl_NotifyChannel(), even
if you didn't articulate then, or even now.  [chan postevent] is not
Tcl_NotifyChannel().  It operates at an even lower layer:  It is the
mechanism that informs the notifier that there is an event on the
channel.  File event detection is not part of the channel driver, but
rather is a part of the notifier:  select() is called in tclUnixNotfy.c,
not in tclUnixChan.c.  In contrast with other event sources that are
surfaced through select(), which interacts with external systems during
normal event loop operations, a refchan must have some way to make the
event known to Tcl.  [chan postevent] was provided for that purpose.  In
other words, [chan postevent] is the analogue of select(), not the
analogue of Tcl_NotifyChannel().  As such, it should return something to
the notifier, like select() does during normal event loop processing.
The way to have a refchan do that is for [chan postevent] to place an
event on the queue, just as its name implies.  So you got the name
right, even if you weren't sure exactly why.

Requiring every refchan to use [after] to call [chan postevent] is a
completely unnecessary wart.  It's nothing more than leakage of poor
implementation into the script level.  Inspection of the design proves
that it is unnecessary, and not one example has been brought forward
demonstrating any reason a refchan might need to call [chan postevent]
synchronously when a channel is nonblocking.  No one out there is doing
it either, because it would result in sporadic deadlocks.  This is
simply hand-wringing over a fictitious situation.  Since a refchan must
use [after] to call [chan postevent] if it wants to work properly,
taking care of that part at the implementation level is a much better
strategy.

For the last five years, the following bugs have been fixed on
core-8-branch and in trunk

    https://core.tcl-lang.org/tcl/tktview/de232b49f26da1c18e07513d4c7caa203cd27910

    https://core.tcl-lang.org/tcl/tktview/67a5eabbd3d195915d3ccc13246bdecf10e20726

Throughout five years of development on those branches, the fix has
never been called into question, and has never lead to another valid bug
report (at least not one that wasn't quickly fixed).  Before coroutines,
calling [chan postevent] without a timer "only" lead to deadlocks.
Nowadays, when a refchan is implemented as a coroutine, it is an error
to call [chan postevent] synchronously.  I'll repeat that:

    It is an error to call [chan postevent] synchronously.

Coroutine implementations of refchans make it manifestly clear that
[chan postevent] must be asynchronous.  Therefore, there is no point in
trying to pretend that [chan postevent] is synchronous.  It must go
through the event queue, and the only way currently to do that is to use
[after].  Although it is coroutine implementations of refchans that make
this obvious, it is and always has been true for all other
implementations as well.  It's just that with other implementations, the
error can be more camouflaged, with deadlocks happening at unpredictable
times.

But now, via the following commits, the fix for the bugs mentioned
above has been undone.

    core-8-branch

        https://core.tcl-lang.org/tcl/info/8d4d978bd3ca580d

    trunk

        https://core.tcl-lang.org/tcl/info/1ec9927351b255fb

This work was tedious and painstaking, it solved the bugs referenced
above, and it made the operation of refchans more consistent with other
types of channels. Now Tcl has lost this contribution, and all for a
"bug" report that doesn't actually demonstrate any reproducible bug, and
only because someone has some ideas but no code to stand those ideas on.
It's a mistake to undo this work, and it's a mistake to rationalize
undoing it by documenting [chan postevent] as synchronous.  Let's bring
back consistency with the design of the IO subsystem, ditch this
brain-dead idea that [chan postevent] should do anything other than post
a notification for the notifier, and revert the reversions of the code
that fixed the bugs above.

-- 
Yorick