Ticket UUID: | 893f8cc5db3ba8bd6bed80ce89a0a75c6088a37c | |||
Title: | Nested mutexes following TIP 509 | |||
Type: | Bug | Version: | main | |
Submitter: | oehhar | Created on: | 2025-08-22 17:00:47 | |
Subsystem: | - New Builtin Commands | Assigned To: | nobody | |
Priority: | 5 Medium | Severity: | Minor | |
Status: | Open | Last Modified: | 2025-08-24 06:52:08 | |
Resolution: | None | Closed By: | nobody | |
Closed on: | ||||
Description: |
As reported by Christian and Sergey, nested mutexes following TIP 509 are broken. Christian has provided the fix by a core list message titled Re: "[TCLCORE] Review of branch bug-87b69745be" 2025-08-13 17:50 UTC. This fix is now in commit [cdc8e88f7b], which starts branch [893f8cc5-tip509-nested-mutex]. Here are additional informations given in the thread by ChristianFor the record: the Win32 implementation of wait on condition seems broken, too. Due to Win32 critical sections supporting nesting, the logical unlock before the wait on condition *must* leave the critical section as many times as it was entered and restore this afterwards. Otherwise the mutex is locked during the wait, and that violates the contract. Note, that the used wall clock has issues, and eventually TIP 509: Now we have the "wait on condition" operation, which in the POSIX sense can be timely limited to an absolute point of time, when it is to time out. Why the heck you might ask? Dude, due to the spuriously wakeupery, when you've raced and won not. This is not reflected in any current implementation of the Tcl wrappers. And the fine problem in POSIX was, that absolute meant wall clock in the early days. Until POSIX condition variables learned about CLOCK_MONOTONIC. Which IMO is the only usable way to operate "waiting on condition". Since wall clockery might do things under the hood and out of control. AshokI agree your fix, along with the system encoding MT issue Sergey logged, is something that needs attention. Whether that means eliminating the recursive mutex or fixing the condition variable code remains to be seen. While I am wary of the former in a non-major release, since the API is new in Tcl 9, it is likely few people will be using it so reverting may be a possibility. As to the time frame, hopefully in a couple of weeks unless someone gets to it sooner. | |||
User Comments: |
chw added on 2025-08-24 06:52:08:
Oh, I forgot option C): Make the emulated POSIX variant, that it deadlocks like the native POSIX and Win32 variants. Such that all three variants error in the same way and the platform differences are kept minimal. chw added on 2025-08-24 06:40:17: Thinking further I see two options: A) Fix Tcl_Mutex/Tcl_Condition in the true spirit of TIP#509 by using the emulated (not the native recursive) variant on POSIX with fixes around "wait on condition". Apply a similar strategy on Win32. B) Don't worry about recursive mutexes and be happy with the pre TIP509 state of affairs. Option A) would give up on using PTHREAD_MUTEX_RECURSIVE for the benefit to have the Tcl_Mutex/Tcl_Condition combination working without implicit danger of deadlock. Which could as well be achieved for Win32. chw added on 2025-08-24 05:37:10: It gets worse every day: now read this fine print quoted from https://forums.oracle.com/ords/apexds/post/recursive-mutex-and-conditional-wait-4703 In pthread_mutexattr_settype man page on Solaris 7, it is stated that "It is advised that an application should not use a PTHREAD_MUTEX_RECURSIVE mutex with condition variables PTHREAD_MUTEX_RECURSIVE because the implicit unlock performed for a pthread_cond_wait() or pthread_cond_timedwait() will not actually release the mutex (if it had been locked multiple times)... So the combination of recursive mutex and condition variable is a classical two component explosive. And my initial thoughts on trying to fix it for systems which natively do not support recursive mutexes thus were most likely a complete waste of time. Now, when we take the quoted statement (which is 25 years old) serious we should provide both recursive and plain mutexes or ditch recursive mutexes entirely. I'm in favor of the latter (and am still not sorry for my notorious noise). BTW: here is a snippet with which the quoted statement can nicely be reproduced (it panics) when the snippet is compiled in tcl trunk's ready built unix subdirectory: ---snip--- #include <tcl.h> TCL_DECLARE_MUTEX(mutex); Tcl_Condition cond; static int deadlock; static int stopit; static int state; static Tcl_ThreadCreateType WatchDog(void *arg) { int n = 0; while (!stopit && n < 20) { Tcl_Sleep(100); ++n; } if (deadlock) { Tcl_Panic("Deadlock detected"); } TCL_THREAD_CREATE_RETURN; } static Tcl_ThreadCreateType StrayCat(void *arg) { while (!stopit) { Tcl_MutexLock(&mutex); if (state == 0) { state = 1; Tcl_ConditionNotify(&cond); } else if (state == 1) { state = 2; Tcl_ConditionNotify(&cond); } Tcl_MutexUnlock(&mutex); Tcl_Sleep(100); } TCL_THREAD_CREATE_RETURN; } static void TestFunc(void) { Tcl_ThreadId tid1, tid2; int dummy; deadlock = 1; stopit = 0; state = 0; Tcl_CreateThread(&tid1, WatchDog, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); Tcl_MutexLock(&mutex); Tcl_CreateThread(&tid1, StrayCat, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); while (state != 1) { Tcl_ConditionWait(&cond, &mutex, NULL); } Tcl_MutexLock(&mutex); while (state != 2) { Tcl_ConditionWait(&cond, &mutex, NULL); } Tcl_MutexUnlock(&mutex); Tcl_MutexUnlock(&mutex); stopit = 1; Tcl_JoinThread(tid2, &dummy); deadlock = 0; Tcl_JoinThread(tid1, &dummy); } int main(int argc, char **argv) { TestFunc(); return 0; } /* * Local Variables: * mode: c * c-basic-offset: 4 * fill-column: 78 * compile-command: "cc test.c -I../generic -I. -L. -ltcl9.1" * End: */ chw added on 2025-08-23 05:01:04: See attached untested POC patch to fix this on Win32, too. |
Attachments:
- POC.diff [download] added by chw on 2025-08-23 05:01:29. [details]