Tcl Source Code

View Ticket
Login
Ticket UUID: 79474c58800cdf94559a2c01230b4536aa416016
Title: refchan segfault
Type: Bug Version: 8.6.9
Submitter: mjanssen Created on: 2019-07-31 13:22:05
Subsystem: 25. Channel System Assigned To: apnadkarni
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2024-05-21 10:35:31
Resolution: Fixed Closed By: sebres
    Closed on: 2024-05-21 10:35:31
Description:

The following script causes a segfault in DoRead:

proc nexec {cmd in out stderr} {
    lassign [chan pipe] in1 out1
    lassign [chan pipe] in2 out2
    lassign [chan pipe] in3 out3
    puts [exec {*}$cmd >@ $out2 2>@ $out3 <@ $in1 &] 
    fileevent $in2 readable [list chan copy $in2 $out]
    fileevent $in3 readable [list chan copy $in3 $stderr]
    fileevent $out1 writable [list chan copy $in $out1]
}

proc test {args} {
  puts $args 
  set rest [lassign $args mode chan]
  switch -exact $mode {
    read {puts $chan "Test" ; close $chan}
    initialize {return "initialize watch finalize read write"}
    finalize {}
  }
}

set c [chan create "read write" test]

if {$::tcl_platform(platform) ne "windows"} {
  nexec {sh -c "echo test"} $c $c $c
} else {
  nexec {cmd /c "echo test"} $c $c $c
}

vwait forever

Gives with a debug build:

% source exec-test.tcl
initialize rc0 {read write}
36514
read rc0 4096
write rc0 {Test
}
finalize rc0
Assertion failed: (bufPtr != NULL), function DoRead, file /Users/mpcjanssen/Src/tcltk/tcl/generic/tclIO.c, line 9815.

User Comments: sebres added on 2024-05-21 10:35:31:
merged, so closed now.

jan.nijtmans added on 2024-05-21 10:22:20:

You'll notice that core-8-branch is currently in a non-stable state, that's why the latest change is not merged yet to trunk. This creates a problem for you, when trying to merge your change to trunk: it will introduce the same problem in trunk.

Therefore, I suggest to "cherry-pick" your fix to trunk. Hopefully dkf will solve his problem soon, but that shouldn't stop you.


sebres added on 2024-05-21 09:03:16:

All tests have passed too... So I'll merge it to mainline branches.


apnadkarni added on 2024-05-19 07:54:42:
Sergey, reviewed and tested on linux with valgrind, windows with iocp and twapi tls. Looks good to me to be merged.

sebres added on 2024-05-17 13:08:22:

I integrated it in [f41c1304d5f124fe]... and added more clean-ups for other members that look safe and, I can imagine, they may be set after the close, asynchronously via some callbacks or handlers (e. g. synthetic timer scheduled by UpdateInterest, etc).

Thanks for co-working, Ashok.

@everyone, I'll leave it as it is for CI (added a tag to test)...
Thx for review and test.


apnadkarni added on 2024-05-17 12:41:58:
Will leave it to you then. My fix is in apn-channelstate-leak. I preferred not to touch other fields for the reasons I mentioned in the comment. If you are confident they are appropriately null-ed when freed, then yes, it would be good to add them.

Thanks for fixing this vexing crash.

/Ashok

sebres added on 2024-05-17 12:14:22:

Yes I already found it too... Checking now whether other members of statePtrr (besides chanMsg) can be affected in similar way.


apnadkarni added on 2024-05-17 11:29:10:
Found the new reported leak. I think. As expected, not caused by your fixes. ChannelState gets freed via Tcl_EventuallyFreed(.., TCL_DYNAMIC) but that will not free ChannelState.{chanMsg,unreportedMsg}

Only ran into this because of the error you mentioned in the write handler in the new test case.

Testing now.

sebres added on 2024-05-17 10:22:42:

Yep, and trying to find this one leak, I found that a write handler in test reflected command was incorrect - this didn't return an integer, thus produced an after effect error {expected integer but got ""} (that probably additionally caused the leak)...

However fixing the test induced further segfault (and 2nd assert(bufPtr) throws then too).

Commit [2c48f0c544ad9ddd] fixes this segfault now.

I don't know whether it'd circumvent the leak (it was resObj marshalled from an error, probably because of wrong write return code), I rewrote the test in [0dac722930816ea3] to cover both variants (without and with an error in reflected write), although both variants seem to segfault without the fix.

But I also think that this leak has nothing to do with the problem or the fix.


apnadkarni added on 2024-05-17 07:59:10:

Not reviewed yet but there is still a leak - looks like a different one though, and again not related to your changes but exposed by them, I think.

Only the tail of the log is below as it is quite large. Let me know if it would help to have the full log.

repository:   d:\src\repos\tcl.fossil
local-root:   c:/src/tcltk/wip/tcl/
config-db:    C:/Users/apnad/AppData/Local/_fossil
checkout:     47cb98a686c4276bab91492c7a31265ae9be56d6 2024-05-16 20:28:04 UTC
...
==5186== 6,440 (48 direct, 6,392 indirect) bytes in 1 blocks are definitely lost in loss record 113 of 113
==5186==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==5186==    by 0x488CB64: TclpAlloc (tclAlloc.c:699)
==5186==    by 0x48A7273: Tcl_Alloc (tclCkalloc.c:1059)
==5186==    by 0x49D7CEA: Tcl_DuplicateObj (tclObj.c:1587)
==5186==    by 0x49F4AE2: Tcl_GetReturnOptions (tclResult.c:1588)
==5186==    by 0x49B6DC9: MarshallError (tclIORChan.c:997)
==5186==    by 0x49B8E4B: InvokeTclMethod (tclIORChan.c:2380)
==5186==    by 0x49B7566: ReflectInput (tclIORChan.c:1331)
==5186==    by 0x49A3E66: ChanRead (tclIO.c:425)
==5186==    by 0x49ABE48: GetInput (tclIO.c:6941)
==5186==    by 0x49B09C8: DoRead (tclIO.c:9955)
==5186==    by 0x49AFE46: CopyData (tclIO.c:9627)
==5186==    by 0x49AF35E: TclCopyChannel (tclIO.c:9294)
==5186==    by 0x49B58BF: Tcl_FcopyObjCmd (tclIOCmd.c:1719)
==5186==    by 0x4899156: Dispatch (tclBasic.c:4503)
==5186==    by 0x48991E0: TclNRRunCallbacks (tclBasic.c:4539)
==5186==    by 0x489BECD: TclEvalObjEx (tclBasic.c:6104)
==5186==    by 0x489BE5D: Tcl_EvalObjEx (tclBasic.c:6085)
==5186==    by 0x49AECAD: TclChannelEventScriptInvoker (tclIO.c:8996)
==5186==    by 0x49AE27A: Tcl_NotifyChannel (tclIO.c:8455)
==5186==    by 0x4A24AE6: FileWatchNotifyChannelWrapper (tclUnixChan.c:491)
==5186==    by 0x4A3233E: FileHandlerEventProc (tclUnixNotfy.c:808)
==5186==    by 0x49D6B71: Tcl_ServiceEvent (tclNotify.c:670)
==5186==    by 0x49D6FC1: Tcl_DoOneEvent (tclNotify.c:967)
==5186==
==5186== LEAK SUMMARY:
==5186==    definitely lost: 48 bytes in 1 blocks
==5186==    indirectly lost: 6,392 bytes in 129 blocks
==5186==      possibly lost: 0 bytes in 0 blocks
==5186==    still reachable: 0 bytes in 0 blocks
==5186==         suppressed: 0 bytes in 0 blocks

sebres added on 2024-05-16 20:32:31:

Commit [47cb98a686c4276b] should fix the leak.

Tests and reviews are welcome (Ashok, could you please repeat your valgrind test again).


sebres added on 2024-05-16 19:19:10:

OK, it looks like a name of reflected channel itself (the object is rcPtr->name, and after ref-decrement in FreeReflectedChannel its refCount is still 2, so has been not removed and hold the statePtr of channel within).

The issue is therefore not related to this ticket and/or fix...
It is simply a weird constellation: object containing the channel name (rc0) is referenced in reflected channel itself, and if the channel gets closed as long as its state is preserved several times (so FreeReflectedChannel doesn't remove it), a typical cyclic reference gets created:

  • rcPtr->name (Tcl_Obj of channel type) holds the ref to statePtr,
  • statePtr holds the ref to channel,
  • chanPtr holds the ref to reflected (rcPtr via chanPtr->instanceData),
  • rcPtr holds the ref to rcPtr->name (and cycle got closed).

I'll try to solve that somehow as part of this fix.


sebres added on 2024-05-16 18:24:33:

It looks like 1 object (03A09940) containing rc0 ($ch) doesn't get decremented, so still hold a reference to statePtr (and therefore it is not released yet), whereas another object (03A09B08), also containing rc0, is removed.

Here is a debug excerpt explaining that (3x preserve / 2x release of statePtr 03D933F0):

  ***==st allocat: 03D933F0 / Tcl_CreateChannel
  ***++st preserv: 03D933F0 / TclGetChannelFromObj === 03A09B08: rc0 ==
  ***++st preserv: 03D933F0 / TclGetChannelFromObj === 03A09940: rc0 ==
  ***++st preserv: 03D933F0 / Tcl_UnregisterChannel
  ***--st release: 03D933F0 / Tcl_UnregisterChannel
  ***--st release: 03D933F0 / FreeChannelInternalRep === 03A09B08

Maybe it is something like ScriptRecord (script of chan event as list containing $ch as apply parameter) or some interim duplication (which then becomes a leak). However related the whole profiler output, all CreateScriptRecord are paired by DeleteScriptRecord.


sebres added on 2024-05-16 15:18:39:

Hmm, indeed... I see 3 times Tcl_Preserve and only 2 times Tcl_Release for first statePtr (would correspond chan create from test)...
(but there are definitely others - e. g. 2 times Tcl_Preserve and only 1 time Tcl_Release for the second statePtr).

I'm also unsure it'd be caused by the fix directly, probably just introduced by possibility to do such things now. Investigating deeper.


apnadkarni added on 2024-05-16 14:17:31:

Sorry, forgot to include the output

make -s valgrind TESTFLAGS="-f ioCmd.test -match iocmd-32.3"
...
==52905== 6,728 (264 direct, 6,464 indirect) bytes in 1 blocks are definitely lost in loss record 115 of 115
==52905==    at 0x484280F: malloc (vg_replace_malloc.c:442)
==52905==    by 0x4894FA5: TclpAlloc (tclAlloc.c:699)
==52905==    by 0x48AE1C9: Tcl_Alloc (tclCkalloc.c:1059)
==52905==    by 0x49A2745: Tcl_CreateChannel (tclIO.c:1630)
==52905==    by 0x49B332E: TclChanCreateObjCmd (tclIORChan.c:665)
==52905==    by 0x48A0C55: Dispatch (tclBasic.c:4503)
==52905==    by 0x48A0CDB: TclNRRunCallbacks (tclBasic.c:4539)
==52905==    by 0x48A0594: Tcl_EvalObjv (tclBasic.c:4262)
==52905==    by 0x48A2A7D: TclEvalEx (tclBasic.c:5408)
==52905==    by 0x49BE690: Tcl_FSEvalFileEx (tclIOUtil.c:1821)
==52905==    by 0x49C99AB: Tcl_MainEx (tclMain.c:403)
==52905==    by 0x4012F4: main (tclAppInit.c:91)
==52905==
==52905== LEAK SUMMARY:
==52905==    definitely lost: 264 bytes in 1 blocks
==52905==    indirectly lost: 6,464 bytes in 131 blocks
==52905==      possibly lost: 0 bytes in 0 blocks
==52905==    still reachable: 0 bytes in 0 blocks
==52905==         suppressed: 0 bytes in 0 blocks
==52905==

apnadkarni added on 2024-05-16 13:58:32:
I see memory leak from valgrind. Have not reviewed the code yet, but my initial impression from the stack is that it is not from the fix but something that the fix has exposed in error handling.

jan.nijtmans added on 2024-05-16 12:54:02:

+1. Good work, Sergey!


dkf added on 2024-05-16 12:45:05:

Looks good to merge to me.


sebres added on 2024-05-16 09:17:05:

All CI flows have passed... the latest changes are irrelevant for that, so I cancelled the ci-tag for now and set ticket to pending/fixed (and assigned to Ashok).


apnadkarni added on 2024-05-15 15:15:14:
I'll take a look, for my own edification more than anything else, but in a day or two... too busy with other things at the moment.

jan.nijtmans added on 2024-05-15 14:35:17:

> Sebres, Jan, great work!

I'll pass that to Ashok, I didn't do anything ;-)

Anyway, I added a tag such that it will be built on GITHUB, on 3 platforms and various configurations. We'll see the result tomorrow.

Ashok, feel free to review. It looks good to me.


oehhar added on 2024-05-15 13:26:16:

Sebres, Jan, great work! maybe, it may be tested by the continuous integration. Jan may advise, how to do this. Jan modifies a certain file to test a certain branch and it is tested over night...

Take care, Harald


sebres added on 2024-05-15 13:08:50:

Here you go - in branch fix-79474c58800cdf94 you'd find a PR with a test and fix.

And you was right, solely protection by ref-counting of csPtr doesn't always work (additionally it needs preserving of referenced r/w channels as well as small change in DoRead exiting with an error if no bufPtr is there). Anyway it works as expected now (and I also did not see any leak there after implementation of ref-counter).

I'll merge it later to 8.6 if no objections follow.


apnadkarni added on 2024-05-14 16:14:29:

Ref counting csPtr was the very first thing that had come to my mind. Then I saw dgp had already attempted a fix on similar basis and then backed it out. See his comments in https://core.tcl-lang.org/tcl/info/32ae34e63a and the code modifications in https://github.com/tcltk/tcl/commit/1f6b04d51b1c1270bf18e1535567f54c952a3158

Perusing that code led me to realize the ref count based fixes, which were on very similar lines, were not as simple as I had initially imagined. That is where I concluded fixing for 9.0 would be a daunting task given the time frame.

It would be great if you have a solution.

Unfortunately, the issues that caused dgp to back out his fix do not seem to be listed anywhere.


sebres added on 2024-05-14 15:43:50:

Fix can be simpler: protect csPtr (StatePtr->csPtrR & StatePtr->csPtrW) by the refCount, so StopCopy would only free csPtr if its refCount gets 0.

I think I already have a branch having such protection. Looking...


apnadkarni added on 2024-04-29 06:10:33:

For the record, this bug is still present in Tcl 9.0b2.

Looking at the core dump, the root cause seems similar to that in 88aef05cd - reentrancy into TclInvokeMethod. A cursory inspection of the code shows the refchan read method in the test calling close which nests a call into finalize releasing various resources. As the stack unwinds at the top level TclInvokeMethod, freed memory is accessed resulting in the crash.

Fixing would require either

  1. Protection of all data structures to handle reentrant calls

  2. Forbid any recursive calls into refchans from the script level refchan as suggested by @aku in the above mentioned ticket as one of the options.

The former feels a daunting task for 9.0, though I'm not sure exactly how much work is required. The latter is likely simpler (though not trivial) but would mean backward incompatibility (implying it's either done for 9.0 or have to wait till 10.0), loss of function, and would require a TIP.


sebres added on 2019-07-31 19:59:35:

Same on windows (I updated the ticket) - segfaulting too.

Because it may be affected by buffer is still in use, [32ae34e63a] could be related issue (then it affects 8.5 in the same manner).