Tcl Source Code

View Ticket
Login
Ticket UUID: a58737f57bf740979e0ed091118508188b8c28f5
Title: The macOS implementation of Tcl_WaitForEvent is incorrect.
Type: Bug Version: 8.6.10
Submitter: marc_culler Created on: 2020-07-10 22:08:56
Subsystem: 02. Event Loops Assigned To: marc_culler
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2020-07-18 16:58:34
Resolution: Fixed Closed By: marc_culler
    Closed on: 2020-07-18 16:58:34
Description:
The platform specific function Tcl_WaitForEvent (in tclMacOSXNotify.c for
macosx) is called from Tcl_DoOneEvent, the core of the Tcl event loop.  It
gets called in Tcl_DoOneEvent after first checking that there are no
asynchronous events and no window events queued. Before calling Tcl_WaitForEvent
the setup proc for each available event source is called. Then the check proc
is called for each event source.  The main job of the setup proc is to set the
timeout value which will be passed to Tcl_WaitForEvent.  On macOS if there
are any events in the Apple window event queue then the timeout is set to 0,
meaning that Tcl_WaitForEvent should check for file events and return
immediately with a positive return value if any file events have been
recorded by an auxiliary notification thread which runs select in the
background and records any active files descriptors in a thread specific
data structure.  If the Apple window event queue is empty then the timeout
will be non-zero and Tcl_WaitForEvent will wait until the timeout expires.

In actual fact, however, the original Cocoa implementation of Tcl_WaitForEvent
did not return immediately when the timeout was set to 0.  Instead, before
polling for file events, the following loop was run:

 while (Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

This code predates the release of Cocoa and was probably following the
instructions in the Tcl_ServiceAll man page regarding how to embed Tcl in
an application which has its own private event loop. There are two issues
here.  First, the man page does not recommend calling Tcl_ServiceAll in a loop.
Doing that would appear to run the risk of a hang, in cases where servicing
one event causes a new event to be queued. Indeed, there were many reports
of macOS hanging and Kevin Walzer found that calling Tcl_ServiceAll once,
not in a loop, reduced the incidence of hangs.  So, in commit [c699f98468]
in 2015, Kevin removed the loop and replaced it by a single call with
noticeable improvement in the behavior.

The second issue is more subtle.  While it was probably true in the Carbon
implementation of Tcl that Tcl actually was being embedded in an application
with a private event loop, I think that ceased to be the case with the release
of Cocoa.  It is true that the NSApplication object has a private event loop
which can be started by calling the function run().  But Tcl never calls that
function. The run() method handles dispatching of Apple NSEvents to AppKit
objects such as NSView and NSWindow.  But that was not helpful for Tk since
what was needed was to translate NSEvents into XEvents and add them to the
Tcl queue so Tk could process them.  Instead of running Apple's private event
loop, the Cocoa implementation of Tk emulates the Apple run() loop within the
check proc for the Aqua event source.  The check proc removes NSEvents from
the Aqua queue manually and translates them into XEvents before calling
NSApplication sendEvent to allow the NSEvents to be processed by the window
manager.  The adaptation of TclWaitForEvent to account for this change was
never done.

While Kevin's change in 2015 reduced the number of hangs observed in Tk
considerably, they still persisted.  Moreover, the design of Tcl_WaitForEvent
caused a lot of randomness in when the [NSView drawRect] method was being
called.  After Christopher Chavez noticed that the drawRect method only gets
run during calls to CFRunLoopinMode or NSNextEventMatchingMask (used in the
setup proc and the check proc) I was eventually led to study what was actually
happening in Tcl_WaitForEvent.  (With the release of OSX 10.14 (Mojave) it became
critical to control when drawRect gets called because Apple made changes for
that release which caused it to be the case that graphics calls which were not
made from within drawRect were ignored; those calls would have no effect on the
appearance of any application windows, causing Tk to do nothing visible other
than display empty black windows.) 

My conclusion was that TclServiceAll should never be called in TclWaitForEvent.
This is consistent with the fact that Tcl is really not being embedded in an
application with its own private event loop.

I have made this change to tclMacOSXNotify.c in the branch named macOS_hangs.
I have tested extensively both with regression tests and real applications.

There are no noticeable bad effects in Tcl.  All regression tests pass.  The
branch also makes some unrelated changes having to do with the fact that the
NSSpinLock was deprecated by Apple with the release of OSX 10.15 (Catalina) and
replaced by the os_unfair_lock.  The additional changes allow the use of
os_unfair_lock in place of NSSpinLock on systems which support this more efficient
new lock type.

This change to Tcl, together with changes to how [NSView drawRect] is handled
in Tk, led to significant improvements in Tk.  Some regression test failures
disappeared including those reported in [06d8246baf] and [37a27ddb2] and
[aedc9f6fa4] as did many graphical artifacts which were reported for example
in tickets [2a61eca3a].  These changes are in the Tk branch idle_curiosity.

I have extensively tested the combination of idle_curiosity and macOS_hangs
for several months as has Nicolas Bats and, I believe, others.  I would like
to see both of these merged into the core_8_6_branches before the release of
8.6.11.  I would be very grateful for reviews, comments and testing.
User Comments: marc_culler (claiming to be Marc Culler) added on 2020-07-18 16:58:34:
The macOS_hangs branch has been merged into core-8-6-branch and core-8-branch.

I am closing the ticket.

marc_culler (claiming to be Marc Culler) added on 2020-07-17 22:26:38:
The drawing artifacts turned out to be related to withdrawing windows before
configuring them and then deiconifying them later.  The problem appears to
have been fixed in Tk by making some small changes to WmWithdrawCmd and
TkpWmSetState.

chrstphrchvz added on 2020-07-17 03:12:42:

Cc'ing myself for updates. Also fixed version field (10.6.10 → 8.6.10).


kevin_walzer added on 2020-07-17 02:19:21:
Marc, I built the new tip of idle_curiosity and don't see the flicker with the widget demo anymore. However, my Ruby app is still affected. I certainly don't expect you to debug a Ruby app, but the attached screen shot shows what's going on.

marc_culler (claiming to be Marc Culler) added on 2020-07-16 14:39:57:
Thank you, Kevin.

This glitch was caused by the latest commit to the Tk idle_curiosity branch.

I believe it is now working again.  Please retest with the current tip of
idle_curiosity.  I do not see any problems with the animated labels when
I use that tip.

Sorry!

kevin_walzer added on 2020-07-16 13:29:31:
With the combination of these two branches, I see random flashes and redraw issues in the "Animated Label Demonstration" in the widget demo. Just launch and let it run, the animated labels will briefly flash black before redrawing. I'm not sure there's a pattern, nor can it be reproduced except by letting it run and watch for a bit. I also have a Ruby-Tk application that consistently shows redraw issues upon startup that were not there before installing these branches - a large portion of the window is black and will not draw until you mouse over that section of the window. Similarly, I don't have a sample script to reproduce, but can make a download available for testing.

kevin_walzer added on 2020-07-14 02:12:32:
After pulling in the commits from today in idle_curiosity, Wish no longer locks with this script. 

I'll keep poking around just to see if anything crops up.

marc_culler (claiming to be Marc Culler) added on 2020-07-13 13:30:31:
Thank you, Kevin.

I am not able to reproduce this hang.  I start wish.  I source the file that
you pasted in.  I press Run.  The left counter counts up.  I press Pause.  The
right counter says 2.  I am able to enter commands at the terminal and they
are executed.  I do not see a beachball.

I am using the current tips of idle_curiosity and macOS_hangs on Catalina
10.15.5.

????

kevin_walzer added on 2020-07-13 12:24:12:

Can't figure out how to attach a file - any suggestions? Anyway, here is the demo code.


variable main_loop_stack {}
variable stack_height 0
proc push_main_task {args} {
   variable main_loop_stack
   lappend main_loop_stack $args
}
proc completion_main_loop {} {
   variable main_loop_stack
   variable stack_height [llength $main_loop_stack]
   if {[llength $main_loop_stack]} then {
      set cmd [lindex $main_loop_stack end]
      set main_loop_stack [lreplace $main_loop_stack end end]
   } else {
      set cmd increment_count
   }
   eval $cmd
   after 0 [list after idle completion_main_loop]
}
variable main_count 0
proc increment_count {} {
   variable main_count
   incr main_count
   push_main_task count_down $main_count
}
proc count_down {n} {
   after 5
   if {$n} then {push_main_task count_down [expr {$n-1}]}
}
interp alias {} pause_main_loop {} push_main_task return
proc halt_main_loop {} {
   variable main_loop_stack
   set main_loop_stack [linsert $main_loop_stack 0 return]
}
eval {
   toplevel .control
   grid [
      label .control.lcount -textvariable main_count
   ] x [
      label .control.lheight -textvariable stack_height
   ]
   grid [
      button .control.halt -text "Halt" -command halt_main_loop
   ] [
      button .control.pause -text "Pause" -command pause_main_loop
   ] [
      button .control.run -text "Run" -command completion_main_loop
   ]
   wm title .control "Control panel"
}




kevin_walzer added on 2020-07-13 12:17:21:
Marc, to test this, I went back to the original bug report:

https://sourceforge.net/p/tktoolkit/bugs/2823/

and tested the "beachball-bug.tcl" script to document the bug. I'll upload this as an attachment. 

The bug report outlines the issue and illustrates it nicely - Wish locks up. With the requested branches being tested, Wish locks up again. Wish did NOT lock up after my various commits around this issue.

The linked bug report outlines the issue, Daniel Steffen's design decisions, and some proposed fixes, which I later attempted to implement. As I understand it, part of the change in the Tcl branch is removing or substantially modifying my changes. 

You may have a better insight than me into what's going on here and the behavior of the script - please take a look at it and see if you can reproduce my results.

fvogel added on 2020-07-13 09:22:40:
> Could someone add "marc_culler" to the "Assigned to" list? 

Done!

jan.nijtmans added on 2020-07-13 08:41:59:

B.T.W: Could someone add "marc_culler" to the "Assigned to" list?


jan.nijtmans added on 2020-07-13 08:39:10:

Reading the above story and testing the macOS_hangs branch, I would like to accept this for 8.6.11. That doesn't mean it's 100% OK, yet, but I think it's better than what's now on core-8-6-branch. So, merging it now would be an improvement already.

There still are a few testcases failing on Travis sometimes: chan-io-50.2 and chan-io-50.4, io-50.1, io-50.5, io-50.6, event-1.1 were seen failing sometimes. Are those test-cases sensitive to timing? I don't know, it could very well be that the test-case is the problem here. Worth to have a look at before finally merging macOS_fails to core-8-6-branch.

I think there's some work to do for "idle_curiosity" too, but that's stuff for another ticket. Good work!


marc_culler (claiming to be Marc Culler) added on 2020-07-11 14:03:16:
Thanks Kevin!

To test, build both Tcl and Tk using the corresponding branches.  With my
directory setup:
$ rm -rf build
$ cd Tcl
$ fossil pull
$ fossil update macOS_hangs
$ make -C macosx
$ sudo make -C macosx install
$ cd ../Tk
$ fossil pull
$ fossil update idle_curiosity
$ make -C macosx
$ sudo make -C macosx install

We are looking for the usual problems.  We don't want to see hangs or drawing
anomalies; at least not new ones that aren't already present in core-8-6-branch.
This includes temporary hangs, where a window is not completely rendered until
an event such as a mouse click or motion occurs.

Tests could include running the regression tests, running the demo script,
building and testing your apps with this Tcl and Tk, building and testing any
other apps that you have a build setup for.

kevin_walzer added on 2020-07-11 13:30:19:
Marc, happy to test this; how should I combine these two branches and what should I look for in testing?

fvogel added on 2020-07-11 10:13:11:
Marc, I'm not really in a position to test. Nevertheless you have my full vote of confidence.

Attachments: