Ticket UUID: | 534172ff5b7f0e64f11a0ae6ba83ca1b359f73c6 | |||
Title: | Crash in DeleteReflectedChannelMap (introduced via tcllib 1.21) | |||
Type: | Bug | Version: | 6.8.13 and older | |
Submitter: | anonymous | Created on: | 2023-02-19 19:38:51 | |
Subsystem: | 25. Channel System | Assigned To: | jan.nijtmans | |
Priority: | 5 Medium | Severity: | Severe | |
Status: | Closed | Last Modified: | 2023-02-22 08:10:32 | |
Resolution: | Fixed | Closed By: | jan.nijtmans | |
Closed on: | 2023-02-22 08:10:32 | |||
Description: |
Crash in DeleteReflectedChannelMap (introduced by tcllib 1.21) On one of our production sites, we experienced the last days several crashes of Tcl in DeleteReflectedChannelMap()in the statement after GetThreadReflectedChannelMap(), where the system is trying to "get the map of all channels handled by the current thread". This crash happens on the busy site every few hours. This site has typically between 500 and 2000 concurrent users. The crash happens with various Tcl versions, such as 8.6.13 or 8.6.9. After some digging, I found out that the source of the problem was an upgrade from tcllib 1.20 to 1.21. ==================================================================================== #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f259f1e6859 in __GI_abort () at abort.c:79 #2 0x00007f259f43227f in Panic (fmt=<optimized out>) at log.c:951 #3 0x00007f259ed65d2d in Tcl_PanicVA (format=<optimized out>, argList=argList@entry=0x7f2593b32d50) at /usr/local/src/tcl8.6.9/generic/tclPanic.c:99 #4 0x00007f259ed65e9f in Tcl_Panic (format=format@entry=0x7f259f49aba0 "received fatal signal %d") at /usr/local/src/tcl8.6.9/generic/tclPanic.c:160 #5 0x00007f259f47890a in Abort (signal=<optimized out>) at unix.c:1119 #6 <signal handler called> #7 Tcl_NextHashEntry (searchPtr=searchPtr@entry=0x7f2593b33b50) at /usr/local/src/tcl8.6.9/generic/tclHash.c:616 #8 0x00007f259ed37f71 in Tcl_FirstHashEntry (tablePtr=tablePtr@entry=0x55725ff9fa90, searchPtr=searchPtr@entry=0x7f2593b33b50) at /usr/local/src/tcl8.6.9/generic/tclHash.c:574 #9 0x00007f259ed4cef1 in DeleteReflectedChannelMap (clientData=<optimized out>, interp=0x55717e244010) at /usr/local/src/tcl8.6.9/generic/tclIORChan.c:2600 #10 0x00007f259ec9b26b in DeleteInterpProc (interp=0x55717e244010) at /usr/local/src/tcl8.6.9/generic/tclBasic.c:1488 #11 0x00007f259ed212ae in FinalizeThread (quick=quick@entry=0) at /usr/local/src/tcl8.6.9/generic/tclEvent.c:1302 #12 0x00007f259ed21fdb in Tcl_FinalizeThread () at /usr/local/src/tcl8.6.9/generic/tclEvent.c:1273 #13 0x00007f259ee19f3f in Ns_ThreadExit (arg=arg@entry=0x557176a16030) at thread.c:426 #14 0x00007f259f439825 in NsConnThread (arg=0x557176a16030) at queue.c:2378 #15 0x00007f259ee19fd4 in NsThreadMain (arg=<optimized out>) at thread.c:232 #16 0x00007f259ee1a69f in ThreadMain (arg=<optimized out>) at pthread.c:870 #17 0x00007f259e8cb609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #18 0x00007f259f2e3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 #7 Tcl_NextHashEntry (searchPtr=searchPtr@entry=0x7f2593b33b50) at /usr/local/src/tcl8.6.9/generic/tclHash.c:616 616 searchPtr->nextEntryPtr = hPtr->nextPtr; (gdb) p searchPtr $1 = (Tcl_HashSearch *) 0x7f2593b33b50 (gdb) p *searchPtr $2 = {tablePtr = 0x55725ff9fa90, nextIndex = 1, nextEntryPtr = 0x9ee12aa000005572} (gdb) p hPtr $3 = (Tcl_HashEntry *) 0x9ee12aa000005572 (gdb) p *hPtr Cannot access memory at address 0x9ee12aa000005572 (gdb) up #8 0x00007f259ed37f71 in Tcl_FirstHashEntry (tablePtr=tablePtr@entry=0x55725ff9fa90, searchPtr=searchPtr@entry=0x7f2593b33b50) at /usr/local/src/tcl8.6.9/generic/tclHash.c:574 574 return Tcl_NextHashEntry(searchPtr); (gdb) p *searchPtr $3 = {tablePtr = 0x55725ff9fa90, nextIndex = 1, nextEntryPtr = 0x9ee12aa000005572} (gdb) up #9 0x00007f259ed4cef1 in DeleteReflectedChannelMap (clientData=<optimized out>, interp=0x55717e244010) at /usr/local/src/tcl8.6.9/generic/tclIORChan.c:2600 2600 for (hPtr = Tcl_FirstHashEntry(&rcmPtr->map, &hSearch); (gdb) p *rcmPtr $4 = {map = {buckets = 0x557200001004, staticBuckets = {0x55724c79fb40, 0x1, 0x0, 0x5571bebee890}, numBuckets = 401784660, numEntries = 0, rebuildSize = 1610218128, downShift = 21874, mask = 61616064, keyType = 21874, findProc = 0x7f259ed37c20 <FindHashEntry>, createProc = 0x7f259ed37880 <CreateHashEntry>, typePtr = 0x55729723ad10}} ==================================================================================== One can see that the problem happens during shutdown of a thread (in NaviServer) during the deletion of an interpreter. This instance is typically running around 80 threads with Tcl interpreters. These threads are shutdown/restarted every 10k requests to ensure memory cleanup. In some of these regular thread shutdowns, the crash happens. I have added a few debug log statements, showing some addresses. There is no double-free etc. happening. It looks like a concurrency issue or maybe related with the thread shutdown order. ==================================================================================== [17/Feb/2023:11:43:08][1364993.7f2593b34700][-conn:production:default:1:9258-] Notice: ON DELETE done (107ms) DeleteReflectedChannelMap interp 0x55717e244010 DeleteReflectedChannelMap rcmPtr 0x557267a3d010 table 0x557267a3d010 DeleteReflectedChannelMap delete hash table 0x557267a3d010 DeleteReflectedChannelMap GetThreadReflectedChannelMap 0x55725ff9fa90 table 0x55725ff9fa90 [17/Feb/2023:11:43:08][1364993.7f2593b34700][-NONE-] Fatal: received fatal signal 11 ==================================================================================== So, how does this relate to the version of tcllib? This site uses the tcllib package mime, which uses in 1.21 the memchan package, which in turn uses reflected channels. ==================================================================================== -package provide mime 1.6.2 +package provide mime 1.7.0 +package require tcl::chan::memchan ... ==================================================================================== Before the upgrade to tcllib 1,21, the site was not using reflected channels. With the downgrade to tcllib to 1.20, the problem disappeared. Debugging this is not easy, since the operators and users of the site are not happy about the crashes, and so far, I have not been able to reproduce the problem with a simple example. | |||
User Comments: |
jan.nijtmans added on 2023-02-22 08:09:19:
Fixed [c77e351b9c556c05|here] Thanks for the report, and the clear analysis/patch anonymous added on 2023-02-21 17:03:26: Jan many thanks, my quick tests showed that your fix works as well! I have managed now to get a simple test script for NaviServer (without OpenACS). For experimenting, one can reproduce the problem with 5 commands: # 1) get the install script for NaviServer (it creates a subdirectory) git clone https://github.com/gustafn/install-ns.git && cd install-ns # 2) compile and install NaviServer in /usr/local/ns (sources go to /usr/local/src, creates as well a user nsadmin) sudo version_ns=GIT with_debug_flags=1 with_ns_doc=0 with_postgres=0 version_tcllib=1.21 bash install-ns.sh build # 3) increase the frequency of thread restarts to trigger the problem earlier sudo sed -i -e 's/connsperthread *1000/connsperthread 2/' /usr/local/ns/conf/nsd-config.tcl # 4) install a test script named mimepage.tcl echo 'package req mime set token [mime::initialize -canonical text/plain -string {hello world}] set msg [mime::buildmessage $token] mime::finalize $token -subordinates all ns_return 200 text/plain $msg' | sudo tee /usr/local/ns/pages/mimepage.tcl >/dev/null # 5) start NaviServer on http port 8000 in the foreground (see error/crash in console) sudo nsd_httpport=8000 /usr/local/ns/bin/nsd -f -u nsadmin -g nsadmin -t /usr/local/ns/conf/nsd-config.tcl Now, use a browser to navigate to http://YOURHOST:8000/mimepage.tcl and keep pressing reload ... until the crash happens. jan.nijtmans added on 2023-02-21 16:13:20: Many thanks for the explanation, and the patch! I think I understand what's happening. I think [488c80081af54363|this] should do the trick. It frees the rcmPtr even earlier than your patch does it. But, please, any review/feedback is welcome. I will wait a few days before committing this, so please jump in. Andreas is the real "reflected channels" expert here! anonymous (claiming to be gustafn) added on 2023-02-21 09:44:38: The problem appears to be the order of the callback execution during Tcl_FinalizeThread(): Tcl_FinalizeThread() calls DeleteThreadReflectedChannelMap() which frees the thread-channel map. A little bit later, Tcl_Release() calls DeleteInterpProc(), which in turn calls DeleteReflectedChannelMap(), which tries at the end to access the thread map (which was freed before). * DeleteThreadReflectedChannelMap is the thread exit handler for reflected channels, * DeleteReflectedChannelMap is the handler for Tcl_SetAssocData for an interp. So, in the crash situation, we see the following call pattern: Tcl_FinalizeThread DeleteThreadReflectedChannelMap ThreadMap 0x7f2d801c8c20 DeleteThreadReflectedChannelMap FREE ThreadMap 0x7f2d801c8c20 Tcl_Release 0x7f2d80001230 ... Tcl_Release 0x7f2d80001230 DeleteReflectedChannelMap interp 0x7f2d80001230 ThreadMap 0x7f2d801c8c20 The following fix is a possible way to avoid the crash. The fix clears the ThreadSpecificData for the thread-channel map after freeing it. Since the thread-channel map always accessed via GetThreadReflectedChannelMap() it might be recreated again. Therefore, DeleteReflectedChannelMap() will find an existing, but empty thread-channel map. --- ../generic/tclIORChan.c-orig 2023-02-21 10:11:06.826441438 +0100 +++ ../generic/tclIORChan.c-fixed 2023-02-21 10:10:02.101654796 +0100 @@ -2795,6 +2795,10 @@ Tcl_DeleteHashEntry(hPtr); } ckfree(rcmPtr); + { + ThreadSpecificData *tsdPtr = TCL_TSD_INIT(&dataKey); + tsdPtr->rcmPtr=NULL; + } } static void An expert on reflected channels should check, if there might be potential problem s due to the "channel forwarding", which I did not try to understand in details. Another way would be to change the deletion order (delete the thread map after the deletion of the interps...) but such a change has the potential to cause collateral damage. |