Tcl Source Code

View Ticket
Login
2020-04-01
11:48
fixes [f583715154] - tclUnixSock.c: introduced ThreadActionProc considering a transfer of async-conn... check-in: ef20726bcd user: dgp tags: core-8-6-branch
2020-03-31
20:04
fixes [f583715154] - tclUnixSock.c: introduced ThreadActionProc considering a transfer of async-conn... check-in: 532506ef71 user: sebres tags: bug-f583715154-v2
16:55 Ticket [f583715154] Fileevent runs in the wrong thread status still Open with 3 other changes artifact: 7c3ec75f9a user: sebres
2019-12-27
13:55 Ticket [815e246806] tcl crash on channel thread transfer and fileevent writable in place status still Pending with 3 other changes artifact: fe0ed4bbcc user: sebres
2019-08-22
10:02 Ticket [e484a8407c] pipe: time-related sporadic hangs in io/chan-io tests status still Open with 3 other changes artifact: 618aa576db user: sebres
2018-06-12
11:42 Ticket [815e246806] tcl crash on channel thread transfer and fileevent writable in place status still Pending with 3 other changes artifact: 51d427ec57 user: sebres
10:14 Ticket [815e246806]: 4 changes artifact: 97ab9bdffb user: sebres
09:52 Ticket [815e246806]: 4 changes artifact: 1891b2b42a user: oehhar
2018-06-11
11:42 Ticket [815e246806]: 3 changes artifact: ab8dfa8eec user: sebres
11:11
fixed [815e246806] for 8.6th case if channel switches owner in fileevent via thread::transfer and th... check-in: 922ee82dde user: sebres tags: fix-815e246806-8-6
2018-06-08
22:51 Ticket [815e246806] tcl crash on channel thread transfer and fileevent writable in place status still Pending with 3 other changes artifact: 7dde2fc044 user: sebres
20:35 Ticket [815e246806]: 3 changes artifact: 0197c57390 user: sebres
18:59
Fixes bug [815e246806] using atomic reference counter (atomic incr/decr during channel transfer proc... check-in: bd0fca47e3 user: sebres tags: fix-815e246806-8-5
17:17 Ticket [815e246806] tcl crash on channel thread transfer and fileevent writable in place status still Pending with 3 other changes artifact: d24f340efb user: sebres
16:13 Ticket [815e246806]: 3 changes artifact: ae97acc89e user: sebres
15:23 Ticket [815e246806]: 3 changes artifact: 0d2b5128d9 user: max
13:45 Pending ticket [815e246806]. artifact: cd466c081d user: sebres
13:43 Ticket [815e246806]: 3 changes artifact: 5e6858ce0c user: sebres
13:33
merge fix for [815e246806] from 8.5th branch fix-815e246806-8-5 check-in: f44a1ddd60 user: sebres tags: fix-815e246806-8-6
13:28
fixes [815e246806], extend Tcl_RegisterChannel/Tcl_UnregisterChannel with preserving of ref-count of... check-in: dafa1a0859 user: sebres tags: fix-815e246806-8-5
11:42 Ticket [815e246806] tcl crash on channel thread transfer and fileevent writable in place status still Open with 4 other changes artifact: fd46bf22bb user: sebres
11:41 Ticket [815e246806]: 3 changes artifact: 5536eecb2d user: sebres
09:01 Ticket [815e246806]: 3 changes artifact: 24f12cb2af user: sebres
09:01 Ticket [815e246806]: 3 changes artifact: 22dd9b895d user: max
08:32 Ticket [815e246806]: 4 changes artifact: 59d4644da6 user: oehhar
08:17 Ticket [815e246806]: 3 changes artifact: b5833e369f user: max
07:34 New ticket [815e246806]. artifact: 5cf3f5cc39 user: oehhar

Ticket UUID: 815e2468064b4324bc4b303b9bafd3ef92803c32
Title: tcl crash on channel thread transfer and fileevent writable in place
Type: Bug Version: >= 8.5 (partially), >= 8.6
Submitter: oehhar Created on: 2018-06-08 07:34:54
Subsystem: 49. Threading Assigned To: sebres
Priority: 5 Medium Severity: Important
Status: Pending Last Modified: 2019-12-27 13:55:50
Resolution: Fixed Closed By: nobody
    Closed on:
Description:

Summary

Transfering a channel to another thread by "thread::transfer" with a "fileevent writable" on the channel crashes tclsh after around 20 tries on Linux and Windows. The crash goes away, if the file event is cleared before the transfer.

Crash

Test script:

package require Thread

set handleproc {
    proc handle {chan} {
        global name other
        if {$name eq "master"} {
            puts $name
        } else {
            thread::send $other "puts slave"
        }
        thread::transfer $other $chan
        thread::send $other [list fileevent $chan writable [list handle $chan]]
    }
}

eval $handleproc
set name master
set other [thread::create]

thread::send $other [list set name slave]
thread::send $other [list set other [thread::id]]
thread::send $other $handleproc

set nulldev [dict get {unix /dev/null windows NUL}  $tcl_platform(platform)]
set chan [open $nulldev RDWR]
handle $chan

vwait forever 

Demonstration that it works without fileevent active

The following script switches the event of before the transfer. It does not crash:

package require Thread

set handleproc {
    proc handle {chan} {
        global name other
        if {$name eq "master"} {
            puts $name
        } else {
            thread::send $other "puts slave"
        }
        thread::transfer $other $chan
        thread::send $other [subst {
            fileevent $chan writable {
                fileevent $chan writable {}
                after 0 handle $chan
            }
        }]
    }
}

eval $handleproc
set name master
set other [thread::create]

thread::send $other [list set name slave]
thread::send $other [list set other [thread::id]]
thread::send $other $handleproc

set nulldev [dict get {unix /dev/null windows NUL}  $tcl_platform(platform)]
set chan [open $nulldev RDWR]
handle $chan

vwait forever 

References

  • Discussion thread on the core list titled: "TCL (Windows) socket driver and sending fd to another Thread" started 2018-06-07
  • Bug report and test script by Reinhard Max
  • cause analysis by Christian Werner

User Comments: sebres added on 2019-12-27 13:55:50:

It looks like this ticket in thread-repo [d481f15516a465b1] may be affected by this issue too.


sebres added on 2018-06-12 11:42:30:

I tried to test the fix without atomic operations (rewritten atomic-defines similar the last section from current branch if no TCL_THREADS declared, so as (++(p)) and (--(p))).

It's not stable as non-atomic (sporadically produces panic "Channel released more than preserved").
So no way without - either atomic or mutex (which IMHO a bit superfluous for incr/decr).

Please find enclosed better (faster) test-case without hand-brake in the console (so it reproduces very fast a segfault or panic in original tcl without the fix, resp. without atomic operations also):


set handleproc {
    proc handle {chan} {
        global name other cntr
        if 0 {;#
            if {$name eq "master"} {
                puts -nonewline M
                if {!([incr cntr] % 100)} {flush stdout}
            } else {
                thread::send $other {puts -nonewline s}
            }
        } else {
            if {!([incr cntr] % 50)} {
                if {$name eq "master"} {
                    puts -nonewline [expr {$cntr % 2500 ? "" : " -- $cntr\n"}]M
                } else {
                    thread::send $other {puts -nonewline s}
                }
            }
        };#
        if 0 {
            thread::transfer $other $chan
        } else {
            thread::detach $chan
            #thread::send $other [list thread::attach $chan]
            thread::send -async $other [list thread::attach $chan]
        }
        thread::send $other [list fileevent $chan writable [list handle $chan]]
    }
}

eval $handleproc set name master set other [thread::create]

thread::send $other [list set name slave] thread::send $other [list set other [thread::id]] thread::send $other $handleproc

set nulldev [dict get {unix /dev/null windows NUL} $tcl_platform(platform)] set chan [open $nulldev RDWR] handle $chan

vwait forever


sebres added on 2018-06-12 10:14:20:
Hi Harald, Thx for test! Still someone with *nix?:)

And the question remains - what we do with atomic builtins?
Does anyone have some special system/platform (older compiler) to test?

oehhar added on 2018-06-12 09:52:43:

Sergey, thank you for your continuous passionous work. The fix works totally for me.

Test details

Here is my windows test on Win10 2017-10 64 bit GER compiled with 32 bit:

  • Download tarbal from [aacd6f9233]
  • Unzip to C:\test\tcl-aacd6f9233
  • Copy tcl8.6.8 thread package to pkgs: c:\test\tcl8.6.8\pkgs\thread2.8.2 -> c:\test\tcl-aacd6f9233\pkgs
  • Start cmd prompt and run VCVARS32.bat of MS-VC6 and "setenv.cmd /RETAIL" of PSDK2003SP1.
  • cd C:\test\tcl-aacd6f9233\win
  • nmake -f Makefile.vc
  • nmake -f Makefile.vc install INSTALLDIR=c:\test\tcl8.6.8_sebres
  • cd c:\test\tcl8.6.8_sebres\bin
  • Put the test script of the initial report to c:\test\testsocketthread.tcl
  • tclsh86t.exe c:\test\testsocketthread.tcl

The crash does not occure after 1 Minuite run time. With original 8.6.8, it occures after 20 iterations.

TCL Test run:

  • nmake -f Makefile.vc test TESTFLAGS="-verbose bte" > testlog_tcl8.6.8_815e246806_2018-06-12.txt 2>&1

compared to tcl 8.6.8 release (http://wiki.tcl.tk/37529), there are no test differences.

Sergey, great work, thank you, Harald


sebres added on 2018-06-11 11:42:08:

Branch fix-815e246806-8-6 is ready now for testing/review.

This fix could theoretacally work without the new handling round abount channel ref-count (also non atomic), but... I'm not sure it works then in all possible constellations. In addition I would like to save this code to be pretty sure if someone calls this public API across two threads a bit other way (as in thread-module).

And just to have atomic operations in tcl for future requirements (I've used it already for windows since I fixed pipes-error [8bd13f07bd], see [ad65e576ee]).

Just this handling (atomic primitives in tclInt.h) is not complete possibly and may be missing for some older gcc/clang versions resp. some special platforms, that are don't have support for builtin atomic (or just miss some header include).
If so, we should extend this on demand (similar way other projects doing this, e. g. we've this in nginx, see ngx_atomic_fetch_add in ngx_atomic.h for example.
Or just to extend it with non-atomic variants of TclAtomicFetch... if totally unsupported.

ATM tested MS-VC (>= 6.0), mingw (gcc >= 4.2), linux (gcc >= 6.2, clang >= 3.8).


sebres added on 2018-06-08 22:51:59:

The case if channel switches owner via "thread::transfer" fixed now in [2e7300065a].

Currently still as interim state, because thread::detach/attach still failed sporadically with "SIGBUS":


Thread 2 "tclsh" received signal SIGBUS, Bus error.
[Switching to Thread 0x7ffff62bc700 (LWP 20721)]
0x00007ffff7b089c3 in TclChannelEventScriptInvoker () from ./libtcl8.6.so

#0 0x00007ffff7b089c3 in TclChannelEventScriptInvoker () from ./libtcl8.6.so #1 0x00007ffff7b0da0f in Tcl_NotifyChannel () from ./libtcl8.6.so #2 0x00007ffff7b69a77 in FileHandlerEventProc () from ./libtcl8.6.so #3 0x00007ffff7b280cf in Tcl_ServiceEvent () from ./libtcl8.6.so #4 0x00007ffff7b283f4 in Tcl_DoOneEvent () from ./libtcl8.6.so #5 0x00007ffff62c05df in ThreadWaitObjCmd () from ./libthread2.9a1.so #6 0x00007ffff7a61827 in TclNRRunCallbacks () from ./libtcl8.6.so #7 0x00007ffff7a63932 in TclEvalEx () from ./libtcl8.6.so #8 0x00007ffff7a63ee3 in Tcl_EvalEx () from ./libtcl8.6.so #9 0x00007ffff62c2502 in NewThread () from ./libthread2.9a1.so #10 0x00007ffff70236ca in start_thread (arg=0x7ffff62bc700) at pthread_create.c:333 #11 0x00007ffff77600af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

Stack level 0, frame at 0x7ffff62bbc60: rip = 0x7ffff7b089c3 in TclChannelEventScriptInvoker; saved rip = 0x7ffff7b0da0f called by frame at 0x7ffff62bbcb0 Arglist at 0x7ffff62bbc18, args: Locals at 0x7ffff62bbc18, Previous frame's sp is 0x7ffff62bbc60 Saved registers: rbx at 0x7ffff62bbc38, rbp at 0x7ffff62bbc40, r12 at 0x7ffff62bbc48, r13 at 0x7ffff62bbc50, rip at 0x7ffff62bbc58

The reason is the case of synchronous sending of the attach inside of the running event-handler is more complex as provided solution.

But strange is trying to simulate thread::transfer, using "send -async" by attach, ends into SIGSEGV with almost the same call-stack inside TclChannelEventScriptInvoker.

I doubt it may be something wrong with "detach", because it does definitelly the owner thread.

WiP


sebres added on 2018-06-08 20:35:44:

Too complex to port my tcl-thread-module back (too many dependencies to resolve)...

But just tried to protect the refCount of channel with atomic operations and it solves the crashes for windows (8.5/8.6), but on unix it refuses for some reasons for 8.6 (but work in 8.5).

So tried to add assert within TclChannelEventScriptInvoker to check Tcl_GetCurrentThread() == Tcl_GetChannelThread(chanPtr), and it fails, as expected, with:


    [8F63700]/[AC79700] invoke [0x55cd0c822df0]: handle file5
for

    if (Tcl_GetCurrentThread() != Tcl_GetChannelThread(chanPtr)) {
	fprintf(stderr," [%X]/[%X] invoke [%p]: %s\n", 
		(int)Tcl_GetCurrentThread(),
		(int)Tcl_GetChannelThread(chanPtr), chanPtr,
		Tcl_GetString(esPtr->scriptPtr));
    }
So indeed it hangs still somewhere in notifier/event-handler during the transfer-process (but thread-module wait for transmit-confirmation from other thread!).

Still worse is, that if I replace the transfer with attach/detach pair like below, it changes nothing at all:


    if 0 {
	thread::transfer $other $chan
    } else {
	thread::detach $chan
	thread::send $other [list thread::attach $chan]
    }

This implies the "detach" does not work properly (I mean after detach TclChannelEventScriptInvoker will be called again inside the OLD thread).

Strange is also, that it's absolutely not reproducible with my own thread-module (but still the same tcl-version)...


sebres added on 2018-06-08 17:17:28:
So now it going to the thread-module (tested with 2.7 up to 2.8.1 resp. current trunk).

I had a fixed (with rewritten attach/detach/transfer) version installed where it does not occur.
After change to tcl-core version it's pretty reproducible.

Trying to port bug-fix... WiP.

sebres added on 2018-06-08 16:13:53:
Strange... I cannot reproduce it anymore.
Just to be sure, which version of the tclthread-module do you use?
Or better revision if self-compiled...

max added on 2018-06-08 15:23:41:
Unfortunately I see no change in behavior between 8.6.7 and fix-815e246806-8-6.
I still see get the crashes with the "writable" script from Harald's initial description and also with the "readable + delayed" one from my latest post (after replacing [after idle] by [eval]).

BTW, most invocations (pre and post your patch) result in a plain segfault without further output, but sometimes I get an error message + abort with differing error messages, e.g.:

----
alloc: invalid block: 0x5645d07a60b0: c0 d0
alloc: invalid block: 0x5645d07a60b0: c0 d0
----
TclStackFree: incorrect freePtr (0x55807a654ac0 != 0x55807a654b00). Call out of sequence?
----
TclNRExecuteByteCode: abnormal return at pc 6: stack top -18 < entry stack top 0
TclNRExecuteByteCode execution failure: end stack top < start stack top
----
Reuse of ChannelBuffer! 0x55ffee568d80
Reuse of ChannelBuffer! 0x55ffee568d80
----
Tcl_AppendLimitedToObj called with shared object
----

This looks to me like there is still some severe memory corruption going on.

sebres added on 2018-06-08 13:43:30:

Two new branches "fix-815e246806-8-5" and "fix-815e246806-8-6" provide a fix for the issue, see [f44a1ddd60] for example.
Ceched for bot win and nix - neither the panic nor the crash occur now.
All test-cases also passed on all my platforms.

Further testers and reviewers are welcome.


sebres added on 2018-06-08 11:41:35:

Yep, the reason for it is the pair TclChannelPreserve/TclChannelRelease in the TclChannelEventScriptInvoker.

    TclChannelPreserve((Tcl_Channel)chanPtr);
       Tcl_EvalObjEx(interp, esPtr->scriptPtr, ...); etc...
    TclChannelRelease((Tcl_Channel)chanPtr);
Where the call of TclChannelRelease sporadically takes place after the channel belong to other thread (and because of the synchronous thread::send it could already start the transfer-process back).
In this it may cause a sync-issue on "shared" ref-count.


sebres added on 2018-06-08 09:01:56:
Hmm strange, I can't reproduce it on my own tcl-versions at all (but pretty good reproducible with original core-tcl >= 8.5).
I'll try to find what is different there as regards the fileevents.
Could be possibly fixed via sebres-8-x-event-perf-branch's, that are merged on my own tcl's.
I'll supply a fix if I find the reason.

max added on 2018-06-08 09:01:32:

Here is another variant that shows two things:

  1. It also happens with readable events
  2. It is the setup of the new fileevent before the old handler has returned that triggers it, not the actual triggering of the event in the other thread. In other words, even if the triggering of the new event handler is delayed the crash still happens if the fileevent has been set up before the old handler had returned.

I post the variant that works. The crash happens when [after idle] in the handle proc is removed or replaced by [eval].

set handleproc {
    proc handle {chan} {
	global name other
	puts $name
	read $chan
	thread::transfer $other $chan
	after idle [subst {
	    thread::send $other {
		fileevent $chan readable {handle $chan}
	    }
	}]
	thread::send $other [list after 500 [list puts $chan x]]
    }
}

eval $handleproc
set name master
set other [thread::create]
set chan [open |cat RDWR]
fconfigure $chan -buffering none -blocking 0

thread::send $other [list set name slave]
thread::send $other [list set other [thread::id]]
thread::send $other $handleproc

handle $chan

vwait forever


oehhar added on 2018-06-08 08:32:29:

The error escape observed on windows is in line 1974 in generic/tclIO.c :

void
TclChannelRelease(
    Tcl_Channel chan)
{
    Channel *chanPtr = (Channel *) chan;

    if (chanPtr->refCount == 0) {
--->    Tcl_Panic("Channel released more than preserved");
    }
    if (--chanPtr->refCount) {
        return;
    }
    if (chanPtr->typePtr == NULL) {
        ckfree(chanPtr);
    }
}


max added on 2018-06-08 08:17:01:
The fileevent appears to get cleared automatically when transferring the channel, so there is no need to do that manually before the transfer.

The critical point for the crash seems to be that a new fileevent is being set up in the receiving thread while the event handler in the sending thread is still running. So, if I just warp the last line of the handler proc in my original script in

 after idle [list ...]

to defer setting up the new event trigger until the old handler has returned, the crash is gone.