Tcl Source Code

View Ticket
Login
Ticket UUID: 534172ff5b7f0e64f11a0ae6ba83ca1b359f73c6
Title: Crash in DeleteReflectedChannelMap (introduced via tcllib 1.21)
Type: Bug Version: 6.8.13 and older
Submitter: anonymous Created on: 2023-02-19 19:38:51
Subsystem: 25. Channel System Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Severe
Status: Closed Last Modified: 2023-02-22 08:10:32
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2023-02-22 08:10:32
Description:
Crash in DeleteReflectedChannelMap (introduced by tcllib 1.21)

On one of our production sites, we experienced the last days several
crashes of Tcl in DeleteReflectedChannelMap()in the statement after
GetThreadReflectedChannelMap(), where the system is trying to "get the
map of all channels handled by the current thread".

This crash happens on the busy site every few hours. This site has
typically between 500 and 2000 concurrent users. The crash happens
with various Tcl versions, such as 8.6.13 or 8.6.9. After some
digging, I found out that the source of the problem was an upgrade
from tcllib 1.20 to 1.21.

====================================================================================
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f259f1e6859 in __GI_abort () at abort.c:79
#2  0x00007f259f43227f in Panic (fmt=<optimized out>) at log.c:951
#3  0x00007f259ed65d2d in Tcl_PanicVA (format=<optimized out>, argList=argList@entry=0x7f2593b32d50) at /usr/local/src/tcl8.6.9/generic/tclPanic.c:99
#4  0x00007f259ed65e9f in Tcl_Panic (format=format@entry=0x7f259f49aba0 "received fatal signal %d") at /usr/local/src/tcl8.6.9/generic/tclPanic.c:160
#5  0x00007f259f47890a in Abort (signal=<optimized out>) at unix.c:1119
#6  <signal handler called>
#7  Tcl_NextHashEntry (searchPtr=searchPtr@entry=0x7f2593b33b50) at /usr/local/src/tcl8.6.9/generic/tclHash.c:616
#8  0x00007f259ed37f71 in Tcl_FirstHashEntry (tablePtr=tablePtr@entry=0x55725ff9fa90, searchPtr=searchPtr@entry=0x7f2593b33b50)
    at /usr/local/src/tcl8.6.9/generic/tclHash.c:574
#9  0x00007f259ed4cef1 in DeleteReflectedChannelMap (clientData=<optimized out>, interp=0x55717e244010) at /usr/local/src/tcl8.6.9/generic/tclIORChan.c:2600
#10 0x00007f259ec9b26b in DeleteInterpProc (interp=0x55717e244010) at /usr/local/src/tcl8.6.9/generic/tclBasic.c:1488
#11 0x00007f259ed212ae in FinalizeThread (quick=quick@entry=0) at /usr/local/src/tcl8.6.9/generic/tclEvent.c:1302
#12 0x00007f259ed21fdb in Tcl_FinalizeThread () at /usr/local/src/tcl8.6.9/generic/tclEvent.c:1273
#13 0x00007f259ee19f3f in Ns_ThreadExit (arg=arg@entry=0x557176a16030) at thread.c:426
#14 0x00007f259f439825 in NsConnThread (arg=0x557176a16030) at queue.c:2378
#15 0x00007f259ee19fd4 in NsThreadMain (arg=<optimized out>) at thread.c:232
#16 0x00007f259ee1a69f in ThreadMain (arg=<optimized out>) at pthread.c:870
#17 0x00007f259e8cb609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#18 0x00007f259f2e3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


#7  Tcl_NextHashEntry (searchPtr=searchPtr@entry=0x7f2593b33b50) at /usr/local/src/tcl8.6.9/generic/tclHash.c:616
616	    searchPtr->nextEntryPtr = hPtr->nextPtr;
(gdb) p searchPtr
$1 = (Tcl_HashSearch *) 0x7f2593b33b50
(gdb) p *searchPtr
$2 = {tablePtr = 0x55725ff9fa90, nextIndex = 1, nextEntryPtr = 0x9ee12aa000005572}
(gdb) p hPtr
$3 = (Tcl_HashEntry *) 0x9ee12aa000005572
(gdb) p *hPtr
Cannot access memory at address 0x9ee12aa000005572


(gdb) up
#8  0x00007f259ed37f71 in Tcl_FirstHashEntry (tablePtr=tablePtr@entry=0x55725ff9fa90, searchPtr=searchPtr@entry=0x7f2593b33b50)
    at /usr/local/src/tcl8.6.9/generic/tclHash.c:574
574	    return Tcl_NextHashEntry(searchPtr);
(gdb) p *searchPtr
$3 = {tablePtr = 0x55725ff9fa90, nextIndex = 1, nextEntryPtr = 0x9ee12aa000005572}


(gdb) up
#9  0x00007f259ed4cef1 in DeleteReflectedChannelMap (clientData=<optimized out>, interp=0x55717e244010) at /usr/local/src/tcl8.6.9/generic/tclIORChan.c:2600
2600	    for (hPtr = Tcl_FirstHashEntry(&rcmPtr->map, &hSearch);
(gdb) p *rcmPtr
$4 = {map = {buckets = 0x557200001004, staticBuckets = {0x55724c79fb40, 0x1, 0x0, 0x5571bebee890}, numBuckets = 401784660, numEntries = 0, 
    rebuildSize = 1610218128, downShift = 21874, mask = 61616064, keyType = 21874, findProc = 0x7f259ed37c20 <FindHashEntry>, 
    createProc = 0x7f259ed37880 <CreateHashEntry>, typePtr = 0x55729723ad10}}

====================================================================================

One can see that the problem happens during shutdown of a thread (in
NaviServer) during the deletion of an interpreter. This instance is
typically running around 80 threads with Tcl interpreters. These
threads are shutdown/restarted every 10k requests to ensure memory
cleanup. In some of these regular thread shutdowns, the crash happens.

I have added a few debug log statements, showing some addresses. 
There is no double-free etc. happening. It looks like a concurrency
issue or maybe related with the thread shutdown order.

====================================================================================
[17/Feb/2023:11:43:08][1364993.7f2593b34700][-conn:production:default:1:9258-] Notice: ON DELETE done (107ms)
DeleteReflectedChannelMap interp 0x55717e244010
DeleteReflectedChannelMap rcmPtr 0x557267a3d010 table 0x557267a3d010
DeleteReflectedChannelMap delete hash table 0x557267a3d010
DeleteReflectedChannelMap GetThreadReflectedChannelMap 0x55725ff9fa90 table 0x55725ff9fa90
[17/Feb/2023:11:43:08][1364993.7f2593b34700][-NONE-] Fatal: received fatal signal 11
====================================================================================

So, how does this relate to the version of tcllib?  This site uses the
tcllib package mime, which uses in 1.21 the memchan package, which in
turn uses reflected channels.

====================================================================================
-package provide mime 1.6.2
+package provide mime 1.7.0
+package require tcl::chan::memchan
...
====================================================================================

Before the upgrade to tcllib 1,21, the site was not using reflected
channels.  With the downgrade to tcllib to 1.20, the problem
disappeared.

Debugging this is not easy, since the operators and users of the site
are not happy about the crashes, and so far, I have not been able to
reproduce the problem with a simple example.
User Comments: jan.nijtmans added on 2023-02-22 08:09:19:

Fixed [c77e351b9c556c05|here]

Thanks for the report, and the clear analysis/patch


anonymous added on 2023-02-21 17:03:26:
Jan many thanks, my quick tests showed that your fix works as well!

I have managed now to get a simple test script for NaviServer (without OpenACS). 
For experimenting, one can reproduce the problem with 5 commands:

# 1) get the install script for NaviServer (it creates a subdirectory)
git clone https://github.com/gustafn/install-ns.git && cd install-ns

# 2) compile and install NaviServer in /usr/local/ns (sources go to /usr/local/src, creates as well a user nsadmin)
sudo version_ns=GIT with_debug_flags=1 with_ns_doc=0 with_postgres=0 version_tcllib=1.21 bash install-ns.sh build

# 3) increase the frequency of thread restarts to trigger the problem earlier
sudo sed -i -e 's/connsperthread *1000/connsperthread 2/' /usr/local/ns/conf/nsd-config.tcl

# 4) install a test script named mimepage.tcl 
echo 'package req mime
   set token [mime::initialize -canonical text/plain -string {hello world}]
   set msg [mime::buildmessage $token]
   mime::finalize $token -subordinates all
   ns_return 200 text/plain $msg' | sudo tee /usr/local/ns/pages/mimepage.tcl >/dev/null

# 5) start NaviServer on http port 8000 in the foreground (see error/crash in console)
sudo nsd_httpport=8000 /usr/local/ns/bin/nsd -f -u nsadmin -g nsadmin -t /usr/local/ns/conf/nsd-config.tcl

Now, use a browser to navigate to http://YOURHOST:8000/mimepage.tcl 
and keep pressing reload ... until the crash happens.

jan.nijtmans added on 2023-02-21 16:13:20:

Many thanks for the explanation, and the patch! I think I understand what's happening.

I think [488c80081af54363|this] should do the trick. It frees the rcmPtr even earlier than your patch does it. But, please, any review/feedback is welcome. I will wait a few days before committing this, so please jump in.

Andreas is the real "reflected channels" expert here!


anonymous (claiming to be gustafn) added on 2023-02-21 09:44:38:
The problem appears to be the order of the callback execution during Tcl_FinalizeThread():

Tcl_FinalizeThread() calls DeleteThreadReflectedChannelMap() which frees the thread-channel map. 
A little bit later, Tcl_Release() calls DeleteInterpProc(), 
which in turn calls DeleteReflectedChannelMap(), 
which tries at the end to access the thread map (which was freed before).

* DeleteThreadReflectedChannelMap is the thread exit handler for reflected channels,
* DeleteReflectedChannelMap is the handler for Tcl_SetAssocData for an interp.

So, in the crash situation, we see the following call pattern:

   Tcl_FinalizeThread
   DeleteThreadReflectedChannelMap ThreadMap 0x7f2d801c8c20
   DeleteThreadReflectedChannelMap FREE ThreadMap 0x7f2d801c8c20
   Tcl_Release 0x7f2d80001230
   ...
   Tcl_Release 0x7f2d80001230
   DeleteReflectedChannelMap interp 0x7f2d80001230 ThreadMap 0x7f2d801c8c20

The following fix is a possible way to avoid the crash. The fix clears the ThreadSpecificData
for the thread-channel map after freeing it.  Since the thread-channel map 
always accessed via GetThreadReflectedChannelMap() it might be recreated again.
Therefore, DeleteReflectedChannelMap() will find an existing, but empty thread-channel map.


--- ../generic/tclIORChan.c-orig	2023-02-21 10:11:06.826441438 +0100
+++ ../generic/tclIORChan.c-fixed	2023-02-21 10:10:02.101654796 +0100
@@ -2795,6 +2795,10 @@
 	Tcl_DeleteHashEntry(hPtr);
     }
     ckfree(rcmPtr);
+    { 
+        ThreadSpecificData *tsdPtr = TCL_TSD_INIT(&dataKey);
+        tsdPtr->rcmPtr=NULL;
+    }
 }
 

 static void

An expert on reflected channels should check, if there might be potential 
problem s due to the "channel forwarding", which I did not try to 
understand in details.

Another way would be to change the deletion order (delete the thread map 
after the deletion of the interps...) but such a change has the potential 
to cause collateral damage.