Ticket UUID: | 79474c58800cdf94559a2c01230b4536aa416016 | |||
Title: | refchan segfault | |||
Type: | Bug | Version: | 8.6.9 | |
Submitter: | mjanssen | Created on: | 2019-07-31 13:22:05 | |
Subsystem: | 25. Channel System | Assigned To: | apnadkarni | |
Priority: | 5 Medium | Severity: | Important | |
Status: | Closed | Last Modified: | 2024-05-21 10:35:31 | |
Resolution: | Fixed | Closed By: | sebres | |
Closed on: | 2024-05-21 10:35:31 | |||
Description: |
The following script causes a segfault in DoRead: proc nexec {cmd in out stderr} { lassign [chan pipe] in1 out1 lassign [chan pipe] in2 out2 lassign [chan pipe] in3 out3 puts [exec {*}$cmd >@ $out2 2>@ $out3 <@ $in1 &] fileevent $in2 readable [list chan copy $in2 $out] fileevent $in3 readable [list chan copy $in3 $stderr] fileevent $out1 writable [list chan copy $in $out1] } proc test {args} { puts $args set rest [lassign $args mode chan] switch -exact $mode { read {puts $chan "Test" ; close $chan} initialize {return "initialize watch finalize read write"} finalize {} } } set c [chan create "read write" test] if {$::tcl_platform(platform) ne "windows"} { nexec {sh -c "echo test"} $c $c $c } else { nexec {cmd /c "echo test"} $c $c $c } vwait forever Gives with a debug build: % source exec-test.tcl initialize rc0 {read write} 36514 read rc0 4096 write rc0 {Test } finalize rc0 Assertion failed: (bufPtr != NULL), function DoRead, file /Users/mpcjanssen/Src/tcltk/tcl/generic/tclIO.c, line 9815. | |||
User Comments: |
sebres added on 2024-05-21 10:35:31:
merged, so closed now. jan.nijtmans added on 2024-05-21 10:22:20: You'll notice that core-8-branch is currently in a non-stable state, that's why the latest change is not merged yet to trunk. This creates a problem for you, when trying to merge your change to trunk: it will introduce the same problem in trunk. Therefore, I suggest to "cherry-pick" your fix to trunk. Hopefully dkf will solve his problem soon, but that shouldn't stop you. sebres added on 2024-05-21 09:03:16: All tests have passed too... So I'll merge it to mainline branches. apnadkarni added on 2024-05-19 07:54:42: Sergey, reviewed and tested on linux with valgrind, windows with iocp and twapi tls. Looks good to me to be merged. sebres added on 2024-05-17 13:08:22: I integrated it in [f41c1304d5f124fe]... and added more clean-ups for other members that look safe and, I can imagine, they may be set after the close, asynchronously via some callbacks or handlers (e. g. synthetic timer scheduled by UpdateInterest, etc). Thanks for co-working, Ashok. @everyone, I'll leave it as it is for CI (added a tag to test)... apnadkarni added on 2024-05-17 12:41:58: Will leave it to you then. My fix is in apn-channelstate-leak. I preferred not to touch other fields for the reasons I mentioned in the comment. If you are confident they are appropriately null-ed when freed, then yes, it would be good to add them. Thanks for fixing this vexing crash. /Ashok sebres added on 2024-05-17 12:14:22: Yes I already found it too... Checking now whether other members of statePtrr (besides chanMsg) can be affected in similar way. apnadkarni added on 2024-05-17 11:29:10: Found the new reported leak. I think. As expected, not caused by your fixes. ChannelState gets freed via Tcl_EventuallyFreed(.., TCL_DYNAMIC) but that will not free ChannelState.{chanMsg,unreportedMsg} Only ran into this because of the error you mentioned in the write handler in the new test case. Testing now. sebres added on 2024-05-17 10:22:42:
Yep, and trying to find this one leak, I found that a write handler in test reflected command was incorrect - this didn't return an integer, thus produced an after effect error However fixing the test induced further segfault (and 2nd assert(bufPtr) throws then too). Commit [2c48f0c544ad9ddd] fixes this segfault now. I don't know whether it'd circumvent the leak (it was But I also think that this leak has nothing to do with the problem or the fix. apnadkarni added on 2024-05-17 07:59:10: Not reviewed yet but there is still a leak - looks like a different one though, and again not related to your changes but exposed by them, I think. Only the tail of the log is below as it is quite large. Let me know if it would help to have the full log.
sebres added on 2024-05-16 20:32:31: Commit [47cb98a686c4276b] should fix the leak. Tests and reviews are welcome (Ashok, could you please repeat your valgrind test again). sebres added on 2024-05-16 19:19:10: OK, it looks like a The issue is therefore not related to this ticket and/or fix...
I'll try to solve that somehow as part of this fix. sebres added on 2024-05-16 18:24:33: It looks like 1 object (03A09940) containing Here is a debug excerpt explaining that (3x preserve / 2x release of
Maybe it is something like ScriptRecord (script of chan event as list containing sebres added on 2024-05-16 15:18:39: Hmm, indeed... I see 3 times I'm also unsure it'd be caused by the fix directly, probably just introduced by possibility to do such things now. Investigating deeper. apnadkarni added on 2024-05-16 14:17:31: Sorry, forgot to include the output
apnadkarni added on 2024-05-16 13:58:32: I see memory leak from valgrind. Have not reviewed the code yet, but my initial impression from the stack is that it is not from the fix but something that the fix has exposed in error handling. jan.nijtmans added on 2024-05-16 12:54:02: +1. Good work, Sergey! dkf added on 2024-05-16 12:45:05: Looks good to merge to me. sebres added on 2024-05-16 09:17:05: All CI flows have passed... the latest changes are irrelevant for that, so I cancelled the ci-tag for now and set ticket to pending/fixed (and assigned to Ashok). apnadkarni added on 2024-05-15 15:15:14: I'll take a look, for my own edification more than anything else, but in a day or two... too busy with other things at the moment. jan.nijtmans added on 2024-05-15 14:35:17: > Sebres, Jan, great work! I'll pass that to Ashok, I didn't do anything ;-) Anyway, I added a tag such that it will be built on GITHUB, on 3 platforms and various configurations. We'll see the result tomorrow. Ashok, feel free to review. It looks good to me. oehhar added on 2024-05-15 13:26:16: Sebres, Jan, great work! maybe, it may be tested by the continuous integration. Jan may advise, how to do this. Jan modifies a certain file to test a certain branch and it is tested over night... Take care, Harald sebres added on 2024-05-15 13:08:50: Here you go - in branch fix-79474c58800cdf94 you'd find a PR with a test and fix. And you was right, solely protection by ref-counting of csPtr doesn't always work (additionally it needs preserving of referenced r/w channels as well as small change in DoRead exiting with an error if no bufPtr is there). Anyway it works as expected now (and I also did not see any leak there after implementation of ref-counter). I'll merge it later to 8.6 if no objections follow. apnadkarni added on 2024-05-14 16:14:29: Ref counting csPtr was the very first thing that had come to my mind. Then I saw dgp had already attempted a fix on similar basis and then backed it out. See his comments in https://core.tcl-lang.org/tcl/info/32ae34e63a and the code modifications in https://github.com/tcltk/tcl/commit/1f6b04d51b1c1270bf18e1535567f54c952a3158 Perusing that code led me to realize the ref count based fixes, which were on very similar lines, were not as simple as I had initially imagined. That is where I concluded fixing for 9.0 would be a daunting task given the time frame. It would be great if you have a solution. Unfortunately, the issues that caused dgp to back out his fix do not seem to be listed anywhere. sebres added on 2024-05-14 15:43:50: Fix can be simpler: protect csPtr (StatePtr->csPtrR & StatePtr->csPtrW) by the refCount, so StopCopy would only free csPtr if its refCount gets 0. I think I already have a branch having such protection. Looking... apnadkarni added on 2024-04-29 06:10:33: For the record, this bug is still present in Tcl 9.0b2. Looking at the core dump, the root cause seems similar to that in 88aef05cd - reentrancy into TclInvokeMethod. A cursory inspection of the code shows the refchan read method in the test calling close which nests a call into finalize releasing various resources. As the stack unwinds at the top level TclInvokeMethod, freed memory is accessed resulting in the crash. Fixing would require either
The former feels a daunting task for 9.0, though I'm not sure exactly how much work is required. The latter is likely simpler (though not trivial) but would mean backward incompatibility (implying it's either done for 9.0 or have to wait till 10.0), loss of function, and would require a TIP. sebres added on 2019-07-31 19:59:35: Same on windows (I updated the ticket) - segfaulting too. Because it may be affected by buffer is still in use, [32ae34e63a] could be related issue (then it affects 8.5 in the same manner). |
