Tcl Source Code

View Ticket
Login
Ticket UUID: 893f8cc5db3ba8bd6bed80ce89a0a75c6088a37c
Title: Nested mutexes following TIP 509
Type: Bug Version: main
Submitter: oehhar Created on: 2025-08-22 17:00:47
Subsystem: - New Builtin Commands Assigned To: nobody
Priority: 5 Medium Severity: Minor
Status: Open Last Modified: 2025-08-24 06:52:08
Resolution: None Closed By: nobody
    Closed on:
Description:

As reported by Christian and Sergey, nested mutexes following TIP 509 are broken.

Christian has provided the fix by a core list message titled Re: "[TCLCORE] Review of branch bug-87b69745be" 2025-08-13 17:50 UTC.

This fix is now in commit [cdc8e88f7b], which starts branch [893f8cc5-tip509-nested-mutex].

Here are additional informations given in the thread by Christian

For the record: the Win32 implementation of wait on condition seems broken, too. Due to Win32 critical sections supporting nesting, the logical unlock before the wait on condition *must* leave the critical section as many times as it was entered and restore this afterwards. Otherwise the mutex is locked during the wait, and that violates the contract.

Note, that the used wall clock has issues, and eventually TIP 509:

Now we have the "wait on condition" operation, which in the POSIX sense can be timely limited to an absolute point of time, when it is to time out. Why the heck you might ask? Dude, due to the spuriously wakeupery, when you've raced and won not. This is not reflected in any current implementation of the Tcl wrappers. And the fine problem in POSIX was, that absolute meant wall clock in the early days. Until POSIX condition variables learned about CLOCK_MONOTONIC. Which IMO is the only usable way to operate "waiting on condition". Since wall clockery might do things under the hood and out of control.

Ashok

I agree your fix, along with the system encoding MT issue Sergey logged, is something that needs attention. Whether that means eliminating the recursive mutex or fixing the condition variable code remains to be seen. While I am wary of the former in a non-major release, since the API is new in Tcl 9, it is likely few people will be using it so reverting may be a possibility. As to the time frame, hopefully in a couple of weeks unless someone gets to it sooner.

User Comments: chw added on 2025-08-24 06:52:08:
Oh, I forgot option C):

Make the emulated POSIX variant, that it deadlocks like the native
POSIX and Win32 variants. Such that all three variants error in
the same way and the platform differences are kept minimal.

chw added on 2025-08-24 06:40:17:
Thinking further I see two options:

A) Fix Tcl_Mutex/Tcl_Condition in the true spirit of TIP#509 by using the
emulated (not the native recursive) variant on POSIX with fixes around
"wait on condition". Apply a similar strategy on Win32.

B) Don't worry about recursive mutexes and be happy with the pre TIP509
state of affairs.

Option A) would give up on using PTHREAD_MUTEX_RECURSIVE for the benefit
to have the Tcl_Mutex/Tcl_Condition combination working without implicit
danger of deadlock. Which could as well be achieved for Win32.

chw added on 2025-08-24 05:37:10:
It gets worse every day: now read this fine print quoted from
https://forums.oracle.com/ords/apexds/post/recursive-mutex-and-conditional-wait-4703

  In pthread_mutexattr_settype man page on Solaris 7, it is stated that
  "It is advised that an application should not use a PTHREAD_MUTEX_RECURSIVE mutex
  with condition variables PTHREAD_MUTEX_RECURSIVE because the implicit unlock performed
  for a pthread_cond_wait() or pthread_cond_timedwait() will not actually release the
  mutex (if it had been locked multiple times)...

So the combination of recursive mutex and condition variable is
a classical two component explosive. And my initial thoughts on
trying to fix it for systems which natively do not support
recursive mutexes thus were most likely a complete waste of time.

Now, when we take the quoted statement (which is 25 years old) serious
we should provide both recursive and plain mutexes or ditch recursive
mutexes entirely. I'm in favor of the latter (and am still not sorry
for my notorious noise).

BTW: here is a snippet with which the quoted statement can nicely be
reproduced (it panics) when the snippet is compiled in tcl trunk's ready
built unix subdirectory:

---snip---
#include <tcl.h>

TCL_DECLARE_MUTEX(mutex);
Tcl_Condition cond;
static int deadlock;
static int stopit;
static int state;

static Tcl_ThreadCreateType
WatchDog(void *arg)
{
    int n = 0;

    while (!stopit && n < 20) {
        Tcl_Sleep(100);
        ++n;
    }
    if (deadlock) {
        Tcl_Panic("Deadlock detected");
    }
    TCL_THREAD_CREATE_RETURN;
}

static Tcl_ThreadCreateType
StrayCat(void *arg)
{
    while (!stopit) {
        Tcl_MutexLock(&mutex);
        if (state == 0) {
            state = 1;
            Tcl_ConditionNotify(&cond);
        } else if (state == 1) {
            state = 2;
            Tcl_ConditionNotify(&cond);
        }
        Tcl_MutexUnlock(&mutex);
        Tcl_Sleep(100);
    }
    TCL_THREAD_CREATE_RETURN;
}

static void
TestFunc(void)
{
    Tcl_ThreadId tid1, tid2;
    int dummy;

    deadlock = 1;
    stopit = 0;
    state = 0;
    Tcl_CreateThread(&tid1, WatchDog, NULL,
            TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
    Tcl_MutexLock(&mutex);
    Tcl_CreateThread(&tid1, StrayCat, NULL,
            TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
    while (state != 1) {
        Tcl_ConditionWait(&cond, &mutex, NULL);
    }
    Tcl_MutexLock(&mutex);
    while (state != 2) {
        Tcl_ConditionWait(&cond, &mutex, NULL);
    }
    Tcl_MutexUnlock(&mutex);
    Tcl_MutexUnlock(&mutex);
    stopit = 1;
    Tcl_JoinThread(tid2, &dummy);
    deadlock = 0;
    Tcl_JoinThread(tid1, &dummy);
}

int
main(int argc, char **argv)
{
    TestFunc();
    return 0;
}

/*
 * Local Variables:
 * mode: c
 * c-basic-offset: 4
 * fill-column: 78
 * compile-command: "cc test.c -I../generic -I. -L. -ltcl9.1"
 * End:
 */

chw added on 2025-08-23 05:01:04:
See attached untested POC patch to fix this on Win32, too.

Attachments: