Tk Source Code

View Ticket
Login
Ticket UUID: 1220692
Title: tk locks up
Type: Bug Version: None
Submitter: davidw Created on: 2005-06-14 19:49:29
Subsystem: 70. Event Loop Assigned To: hobbs
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2005-06-21 20:02:25
Resolution: None Closed By: davidw
    Closed on: 2005-06-21 13:02:25
Description:
This script, when run under strace, locks up Tk on Linux:

# loop.tcl

# Run like so:
# strace tclsh loop.tcl
package require Tk

proc screenupdate {num} {
    incr num
    .foo configure -text $num
    after 10 [list screenupdate $num]
}

button .foo
pack .foo

screenupdate 0
-----

Or better, it seems to reproduce the bug I see in a
script that locks up even without strace.

The following script will lock it up without strace. 
Ok - so it's kind of evil, but still, it seems to be
tickling a race condition or something.  Note that it's
not just slow - it's locked, it's not doing anything.

# loop.tcl

# Run like so:
# strace tclsh loop.tcl
package require Tk

proc afterburner {} {
    after 10 afterburner
}


proc screenupdate {num} {
    incr num
    .foo configure -text $num
    after 10 [list screenupdate $num]
}

button .foo
pack .foo

screenupdate 0

afterburner
afterburner
afterburner

afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner

afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner

afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner



-------------
User Comments: davidw added on 2005-06-21 20:02:25:
Logged In: YES 
user_id=240

I'm going to go ahead and close thus.   I'm convinced the
problem lies with Debian's libc.  If I discover that to not
be the case, we can always reopen this bug.

dkf added on 2005-06-16 05:24:11:
Logged In: YES 
user_id=79902

Reviewing the issues you've found, it does indeed look like
glibc 2.3.2 is subtly bust in the threading department. This
might explain some bizarre hangs of production Java apps on
some of our systems. :-( Some of the comments in that
bugzilla report didn't exactly raise the spirits either, but
then I've long suspected that glibc was not maintained to
the quality levels I'm accustomed to with Tcl...

davidw added on 2005-06-16 05:06:54:
Logged In: YES 
user_id=240

Let's compare libc versions.  Debian sarge has:

Version: 2.3.2.ds1-22

The last entry in /usr/share/doc/libc6/changelog.gz is dated:

2003-09-19  Ulrich Drepper  <[email protected]>

        * elf/Makefile ($(objpfx)librtld.os): Create link
map also for
        final build to ease quality control.

So to me it looks like more and more things point to libc6.
 I've started a thread on the Debian list - once we have
confirmation there, we can close this bug.

http://lists.debian.org/debian-glibc/2005/06/msg00134.html

vasiljevic added on 2005-06-16 00:19:53:
Logged In: YES 
user_id=95086

Hm.... 

This is the test system (SuSE Linux on a Dell 2.4GHz):

zoran@zvlinux:~/sf/core-8-4-branch/unix> uname -a
Linux zvlinux 2.6.4-52-default #1 Wed Apr 7 02:08:30 UTC 2004 i686 
i686 i386 GNU/Linux

zoran@zvlinux:~/sf/core-8-4-branch/unix> strace -o dummy ./tclsh
% array get tcl_platform
osVersion 2.6.4-52-default byteOrder littleEndian threaded 1 machine 
i686 platform unix os Linux user zoran wordSize 4

% set tcl_patchLevel
8.4.10


This is the test program which exercises timer events, 
file events and inter-thread communication events 
(using Tcl thread package and thread::broadcast command)


package req Thread
for {set i 0} {$i < 16} {incr i} {
    thread::create {
        set tack 0
        set fl [open /dev/zero]
        proc afterburner {} {
            if {($::tack % 1000) == 0} {
                puts AB:tack([thread::id])-$::tack
            }
            after 1 afterburner
        }
        proc screenupdate num {
            if {[incr num] % 1000 == 0} {
                puts SU:tick([thread::id])-$num
            }
            thread::broadcast "set ::tack $num"
            after 1 screenupdate $num
        }
        proc readzero {chan} {
            if {($::tack % 1000) == 0} {
                puts RZ:tack([thread::id])-$::tack
            }
            after 1
        }
        fconfigure $fl -blocking 0
        fileevent $fl readable [list readzero $fl]
        afterburner
        screenupdate 1 
        thread::wait
    }
}


If I start this, it will create 16 threads which will chew-up lots of CPU
and exercise the notifier  *a lot*. This runs happily on my system:

% thread::names
tid0x424b1bb0 tid0x422b0bb0 tid0x420afbb0 tid0x41eaebb0 
tid0x41cadbb0 tid0x41aacbb0 tid0x418abbb0 tid0x416aabb0 
tid0x414a9bb0 tid0x412a8bb0 tid0x410a7bb0 tid0x40ea6bb0 
tid0x40ca5bb0 tid0x40aa4bb0 tid0x408a3bb0 tid0x406a2bb0 
tid0x402392a0

SU:tick(tid0x424b1bb0)-124000
AB:tack(tid0x418abbb0)-124000
SU:tick(tid0x408a3bb0)-124000
RZ:tack(tid0x424b1bb0)-124000
RZ:tack(tid0x418abbb0)-124000
RZ:tack(tid0x40ca5bb0)-124000
RZ:tack(tid0x412a8bb0)-124000
RZ:tack(tid0x40ea6bb0)-124000
AB:tack(tid0x412a8bb0)-124000
AB:tack(tid0x40ea6bb0)-124000
AB:tack(tid0x424b1bb0)-124000
SU:tick(tid0x40aa4bb0)-124000
SU:tick(tid0x412a8bb0)-124000
AB:tack(tid0x40ca5bb0)-124000
SU:tick(tid0x416aabb0)-124000
SU:tick(tid0x40ea6bb0)-124000
SU:tick(tid0x41aacbb0)-124000
SU:tick(tid0x422b0bb0)-124000
SU:tick(tid0x41eaebb0)-124000
AB:tack(tid0x416aabb0)-124000
SU:tick(tid0x414a9bb0)-125000
RZ:tack(tid0x408a3bb0)-125000
RZ:tack(tid0x420afbb0)-125000
SU:tick(tid0x418abbb0)-124000
SU:tick(tid0x40ca5bb0)-125000
RZ:tack(tid0x41cadbb0)-125000
RZ:tack(tid0x414a9bb0)-125000
RZ:tack(tid0x422b0bb0)-125000
RZ:tack(tid0x406a2bb0)-125000
RZ:tack(tid0x41eaebb0)-125000
RZ:tack(tid0x40aa4bb0)-125000
(and so on and so on...)

You can observe the alternating thread-ID's and counters.
NOTE: it runs under "strace", swamping my disk and chewing 
my CPU:

zoran@zvlinux:/usr/local> ps -efl | grep tclsh
0 S zoran     5979  4125  0  76   0 -   394 wait4  19:04 pts/40   00:00:00 
strace -o dummy ./tclsh
0 S zoran     5980  5979  0  75   0 - 48855 schedu 19:04 pts/40   
00:00:00 ./tclsh

top - 19:11:03 up 42 min,  2 users,  load average: 11.08, 7.83, 4.16

I cannot conclude anything but that this might be
some weird Linux issue on your config. Can you
verify this on some other Linux distribution or
perhaps other processor? 

I gave it a try under Solaris/Sparc (2.8) and Darwin/PPC (10.4)
and it also works as expected.

(Confused) Zoran

davidw added on 2005-06-16 00:09:57:
Logged In: YES 
user_id=240

This bug looks an awful lot like this:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=108631

although their provided code doesn't hang...

I've got an email into the debian libc maintainers to see if
they have anything to say.

vasiljevic added on 2005-06-15 22:19:04:
Logged In: YES 
user_id=95086

Most interesting....

I do not believe that this might have to do anything with
this problem, but the code arround line 755 in unix/tclUnixNotfy.c
should be better looking like this:

    while (!tsdPtr->eventReady) {
        Tcl_ConditionWait(&tsdPtr->waitCV, &notifierMutex, timePtr);
    }

instead of 

    if (!tsdPtr->eventReady) {
        Tcl_ConditionWait(&tsdPtr->waitCV, &notifierMutex, timePtr);
    }

This is more defensive and could save us some headaches
down the road.

The fprintf debugging shows clearly that the notify event
actually never reached the waiter and I could not imagine
why, unless there is really a quirk in the libc.
Let me see If I can reproduce this on my machine.

davidw added on 2005-06-15 21:46:42:
Logged In: YES 
user_id=240

The reason why I think it's NPTL is that I can't reproduce
the crash on non-NPTL systems, for instance, my PPC Linux
box, or the exact same code when you set the environmental
variable:

export LD_ASSUME_KERNEL=2.4.1

which "turns off" NPTL.

I wouldn't completely rule out a bug in the kernel or libc6,
but before we go that route, let's see if we can isolate it
further so they don't blame it on us.

Here's another data point.  I added some debugging fprintf's
around the pthread_cond_wait and broadcast calls:

*68 is the notifier thread, *80 is the main thread

pthread_cond_wait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
cond_wait returned
pthread_cond_wait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
cond_wait returned
pthread_cond_timedwait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
timedwait returned
pthread_cond_wait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
cond_wait returned
pthread_cond_wait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
cond_wait returned
pthread_cond_timedwait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
timedwait returned
pthread_cond_timedwait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
timedwait returned
pthread_cond_wait: thread 1076375680 mutex 804c7d0
pthread_cond_broadcast: thread 1085057968 condptr 806bb00
*HANGS*

So it never wakes up from its cond_wait, even though the
broadcast  happened, supposedly...

vasiljevic added on 2005-06-15 21:28:56:
Logged In: YES 
user_id=95086

Good (or bad, depends on the perspective) :-)

Thanks to you, we have at least isolated the Tk out.
This makes things simpler (less of complex code to dig...)

So, as you say, this is a NPTL thingy? 

I do have a RH (Enterprise) Dell box here so I assume I 
could give it a test-try and observe (hopefully) the same
behaviour...

davidw added on 2005-06-15 21:20:43:
Logged In: YES 
user_id=240

The suggested Tcl only scripts do not *seem* to cause a
hang, but of course, with something that in other situations
has lasted hours with no hangs, we can only prove that it
does hang.

This Tcl only script does cause a hang rather quickly (with
strace):

set fl [open /dev/zero]

proc afterburner {} {after 10 afterburner}
proc screenupdate num {
    if {[incr num] % 100 == 0} {puts tick-$num}
    after 10 screenupdate $num
}

proc readzero {chan } {
}

fconfigure $fl -blocking 0

fileevent $fl readable [list readzero $fl]


afterburner

screenupdate 1; 

vwait forever

Stack traces are the same.

Local variables:

(gdb) up
#2  0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at
tclUnixNotfy.c:774
774             Tcl_ConditionWait(&tsdPtr->waitCV,
&notifierMutex, myTimePtr);
(gdb) info locals 
filePtr = (FileHandler *) 0x0
fileEvPtr = (FileHandlerEvent *) 0x0
mask = 134541992
myTime = {sec = 0, usec = 0}
waitForFiles = 1
myTimePtr = (Tcl_Time *) 0x0
tsdPtr = (ThreadSpecificData *) 0x804eba8

(gdb) thread 2
[Switching to thread 2 (Thread 1085053872 (LWP 19025))]#0 
0x4021ea27 in select
    () from /lib/tls/libc.so.6
(gdb) up
#1  0x400ed563 in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:993
993             if (select(numFdBits, &readableMask,
&writableMask, &exceptionalMask,
(gdb) info locals 
tsdPtr = (ThreadSpecificData *) 0x0
readableMask = {__fds_bits = {8, 0 <repeats 31 times>}}
writableMask = {__fds_bits = {0 <repeats 32 times>}}
exceptionalMask = {__fds_bits = {0 <repeats 32 times>}}
fds = {3, 4}
i = -1
status = 2049
numFdBits = 6
receivePipe = 3
found = 1
poll = {tv_sec = 0, tv_usec = 0}
timePtr = (struct timeval *) 0x0
buf = "\000"

Hrm...

dkf added on 2005-06-15 19:51:58:
Logged In: YES 
user_id=79902

What are all the contents of the local variables (at the
Tcl-library level, of course) in both threads when the
threads deadlock?

vasiljevic added on 2005-06-15 19:51:15:
Logged In: YES 
user_id=95086

Yes. I assumed David would understand the *intention*
and I just sketched the think briefly... Main intention is 
to filter the noise Tk might bring into.

dkf added on 2005-06-15 19:46:31:
Logged In: YES 
user_id=79902

I think Zoran means the following script... ;-)

proc afterburner {} {after 10 afterburner}
proc screenupdate num {
    if {[incr num] % 100 == 0} {puts tick-$num}
    after 10 screenupdate $num
}
afterburner;screenupdate 1; vwait forever

vasiljevic added on 2005-06-15 19:27:52:
Logged In: YES 
user_id=95086

OK. If I understand this well, you could/should have been able
to reproduce such behaviour even without Tk, by just using the
plain Tcl shell:

proc afterburner {} {
    after 10 afterburner
}
proc screenupdate {num} {
    incr num
    if {$num % 100} {
        puts"tick-$num"
    }
    after 10 [list screenupdate $num]
}

The above should emit "tick-1", "tick-2" and so on... about each second.
Can you try this on your machine (maybe employing the strace trick)
and see if this locks-up as well?

davidw added on 2005-06-15 19:05:28:
Logged In: YES 
user_id=240

It's the same thing:

Thread 2 (Thread 1084599216 (LWP 19682)):
#0  0x401f6a27 in select () from /lib/tls/libc.so.6
#1  0x400c7f9b in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:976
#2  0x400f8b63 in start_thread () from /lib/tls/libpthread.so.0
#3  0x401fd18a in clone () from /lib/tls/libc.so.6

Thread 1 (Thread 1076207744 (LWP 19681)):
#0  0x400fb295 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1  0x400c6e15 in Tcl_ConditionWait (condPtr=0x8052558,
mutexPtr=0x400e25c0, 
    timePtr=0x0) at tclUnixThrd.c:725
#2  0x400c7a07 in Tcl_WaitForEvent (timePtr=0x0) at
tclUnixNotfy.c:755
#3  0x4009df35 in Tcl_DoOneEvent (flags=-3) at
../generic/tclNotify.c:924
#4  0x40accef4 in Tk_MainLoop ()
    at /home/davidw/workshop/stabletcl/tk/generic/tkEvent.c:1501
#5  0x40098e5d in Tcl_Main (argc=1, argv=0xbffff9b8, 
    appInitProc=0x8048746 <Tcl_AppInit>) at
../generic/tclMain.c:485
#6  0x0804873c in main (argc=2, argv=0xbffff9b4) at
tclAppInit.c:90
#0  0x400fb295 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0

Here are the final syscalls of both threads too, just for fun:

Main thread:

14:03:46.010856 write(4, "\0", 1)       = 1
14:03:46.011039 futex(0x80f7238, FUTEX_WAIT, 10186, NULL) = 0
14:03:46.011719 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.011967 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.012093 gettimeofday({1118837026, 12164},
{4294967176, 0}) = 0
14:03:46.012262 gettimeofday({1118837026, 12292},
{4294967176, 0}) = 0
14:03:46.012326 write(4, "\0", 1)       = 1
14:03:46.012384 gettimeofday({1118837026, 12402},
{4294967176, 0}) = 0
14:03:46.012431 clock_gettime(CLOCK_REALTIME, {1118837026,
12452000}) = 0
14:03:46.012477 futex(0x80f7238, FUTEX_WAIT, 10187, {0,
607000}) = -1 ETIMEDOUT (Connection timed out)
14:03:46.014276 write(4, "\0", 1)       = 1
14:03:46.014503 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.014721 gettimeofday({1118837026, 14897},
{4294967176, 0}) = 0
14:03:46.014989 gettimeofday({1118837026, 15053},
{4294967176, 0}) = 0
14:03:46.015325 gettimeofday({1118837026, 15497},
{4294967176, 0}) = 0
14:03:46.015636 gettimeofday({1118837026, 15721},
{4294967176, 0}) = 0
14:03:46.015817 write(5,
"\2\30\4\0\16\0\0\5\2\0\0\0\354\354\354\0", 16) = 16
14:03:46.016247 write(4, "\0", 1)       = 1
14:03:46.016423 futex(0x80f7238, FUTEX_WAIT, 10188, NULL) = 0
14:03:46.016786 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.016845 gettimeofday({1118837026, 16864},
{4294967176, 0}) = 0
14:03:46.016959 write(5,
"5\30\4\0\17\0\0\5\16\0\0\0058\0\34\0F\216\5\0\17\0\0\5"...,
236) = 236
14:03:46.017577 gettimeofday({1118837026, 17616},
{4294967176, 0}) = 0
14:03:46.017659 write(4, "\0", 1)       = 1
14:03:46.043395 gettimeofday({1118837026, 73560},
{4294967176, 0}) = 0
14:03:46.080086 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.084923 futex(0x80f7228, FUTEX_WAKE, 1) = 1
14:03:46.085127 clock_gettime(CLOCK_REALTIME, {1118837026,
85253000}) = 0
14:03:46.085655 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.085793 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.085924 gettimeofday({1118837026, 85983},
{4294967176, 0}) = 0
14:03:46.086069 ioctl(5, FIONREAD, [32]) = 0
14:03:46.086240 read(5,
"\16\0\344\217\16\0\0\5\0\0>\0\0\0\0\0\0\0\0\0\17\0\0\5"...,
32) = 32
14:03:46.086551 gettimeofday({1118837026, 86643},
{4294967176, 0}) = 0
14:03:46.086903 gettimeofday({1118837026, 87058},
{4294967176, 0}) = 0
14:03:46.087353 gettimeofday({1118837026, 87514},
{4294967176, 0}) = 0
14:03:46.087658 write(5,
"\2\30\4\0\16\0\0\5\2\0\0\0\354\354\354\0", 16) = 16
14:03:46.088171 write(4, "\0", 1)       = 1
14:03:46.088569 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.089297 futex(0x80f7228, FUTEX_WAKE, 1) = 1
14:03:46.089604 futex(0x80f7238, FUTEX_WAIT, 10191, NULL
<unfinished ...>

Notifier thread:


14:03:46.003650 read(3, "\0", 1)        = 1
14:03:46.003698 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout)
14:03:46.003802 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0
14:03:46.003931 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.003969 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 0
14:03:46.004010 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.005133 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.005647 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.005779 read(3, "\0", 1)        = 1
14:03:46.005916 select(6, [3 5], [], [], NULL) = 1 (in [5])
14:03:46.006136 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.006446 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.007490 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.008408 read(3, "\0", 1)        = 1
14:03:46.009190 select(6, [3 5], [], [], NULL) = 1 (in [5])
14:03:46.009451 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.009737 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.010384 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.011252 read(3, "\0", 1)        = 1
14:03:46.011397 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout)
14:03:46.011588 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.011846 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.012547 select(6, [3 5], [], [], NULL) = 1 (in [3])
14:03:46.012698 read(3, "\0", 1)        = 1
14:03:46.012744 select(6, [3 5], [], [], NULL) = 1 (in [3])
14:03:46.014541 read(3, "\0", 1)        = 1
14:03:46.014761 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.016452 read(3, "\0", 1)        = 1
14:03:46.016511 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout)
14:03:46.016637 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.016699 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.073447 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.080279 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.080327 read(3, "\0", 1)        = 1
14:03:46.080397 select(6, [3 5], [], [], NULL) = 1 (in [5])
14:03:46.080545 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0
14:03:46.085179 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.085319 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 0
14:03:46.085444 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.088418 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.088697 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.088813 read(3, "\0", 1)        = 1
14:03:46.088945 select(6, [3 5], [], [], {0, 0}) = 1 (in
[5], left {0, 0})
14:03:46.089182 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0
14:03:46.089390 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.089435 select(6, [3], [], [], NULL <unfinished ...>

Seems to be pretty much the same as HEAD.

vasiljevic added on 2005-06-15 18:41:30:
Logged In: YES 
user_id=95086

Try checking out the core-8-4-branch, recompile with threads/symbols
and give it another try (supplying the stacktraces of all threads at 
the time of the hung).

davidw added on 2005-06-15 16:57:42:
Logged In: YES 
user_id=240

Here you are:

Thread 2 (Thread 1085053872 (LWP 14698)):
#0  0x4021ea27 in select () from /lib/tls/libc.so.6
#1  0x400ed563 in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:993
#2  0x40120b63 in start_thread () from /lib/tls/libpthread.so.0
#3  0x4022518a in clone () from /lib/tls/libc.so.6

Thread 1 (Thread 1076371584 (LWP 14697)):
#0  0x40123295 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1  0x400ec3cb in Tcl_ConditionWait (condPtr=0x804eec0,
mutexPtr=0x4010ae5c, 
    timePtr=0x0) at tclUnixThrd.c:769
#2  0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at
tclUnixNotfy.c:774
#3  0x400b14ac in Tcl_DoOneEvent (flags=-3) at
../generic/tclNotify.c:930
#4  0x40af6a6b in Tk_MainLoop ()
    at /home/davidw/workshop/tk/generic/tkEvent.c:2019
#5  0x400a80cd in Tcl_Main (argc=1, argv=0xbffff9b8, 
    appInitProc=0x8048746 <Tcl_AppInit>) at
../generic/tclMain.c:636
#6  0x0804873c in main (argc=2, argv=0xbffff9b4) at
tclAppInit.c:87
#0  0x40123295 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0

Interesting enough:

(gdb) thread 2
[Switching to thread 2 (Thread 1085053872 (LWP 14698))]#0 
0x4021ea27 in select
    () from /lib/tls/libc.so.6
(gdb) up
#1  0x400ed563 in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:993
993             if (select(numFdBits, &readableMask,
&writableMask, &exceptionalMask,
(gdb) print timePtr
$1 = (struct timeval *) 0x0

So basically, *both* the select and condition have a null
time parameter and block indefinitely.

By the way, to repeat this, I had to be a bit sneaky:

1 start the tclsh

2 attach an strace to it, which seems to slow it down, or do
whatever else seems to trigger the bug more easily than just
waiting for it to happen naturally.

3 detach the strace and attach gdb

The thing that really jumps out at me is this:

#1  0x400ec3cb in Tcl_ConditionWait (condPtr=0x804eec0,
mutexPtr=0x4010ae5c,     timePtr=0x0) at tclUnixThrd.c:769
#2  0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at
tclUnixNotfy.c:774

The timePtr goes to NULL!  That's a recipe to get into the
bad condition where both threads wait forever.  Why's that
happening....  Looks like this code was fiddled recently. 
I'll look at it some more.

vasiljevic added on 2005-06-15 15:58:53:
Logged In: YES 
user_id=95086

When you run into this, can you please attach the gdb to the
hung-up process and provide:

a. info threads
b. for each thread, the stack backtrace (bt)

When using the NPTL, there is only one process in the process-list
not one process per-thread (as with LinuxThreads). So just fire up 
the gdb and say

attach <pid>

You'd need to compile the code with --enable-symbols to get any
meaningful info out of there, of course.

Thanks

davidw added on 2005-06-15 05:52:47:
Logged In: YES 
user_id=240

export LD_ASSUME_KERNEL=2.4.1

seems to make it not use the NPTL thread system, reverting
to an older but apparently more stable way of doing things.

davidw added on 2005-06-15 04:53:47:
Logged In: YES 
user_id=240

Another data point: it seems that x86 Linux uses futexes,
whereas PPC uses another system...

davidw added on 2005-06-15 04:30:07:
Logged In: YES 
user_id=240

My PPC machine, running Debian stable, does *not* seem to
exhibit this problem, or at least not as readily as x86
systems do.  That doesn't mean it doesn't exist here, but
should it not be present here, perhaps it's indicative of a
problem that in reality belongs to the kernel or libc...

davidw added on 2005-06-15 02:54:48:
Logged In: YES 
user_id=240

Here is some info:

% info patch
8.4.9

Running on Debian stable

(davidw) 51 % puts $tk_patchLevel
8.4.9

More info as needed.