Tcl Source Code

View Ticket
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.
Ticket UUID: 4955f5d8a4dce006e063f924bc9709f7850ecb6a
Title: tailcall splicing spot vs. raw TclInvokeObjectCommand
Type: Bug Version: 8.6
Submitter: ferrieux Created on: 2014-05-12 22:28:59
Subsystem: 60. NRE and coroutines Assigned To: msofer
Priority: 8 Severity: Severe
Status: Open Last Modified: 2014-05-20 17:58:44
Resolution: None Closed By: nobody
    Closed on:
When called at level 1 in TclInvokeObjectCommand, tailcall panics:

  tailcall cannot find the right splicing spot: should not happen!

By level 1 I mean: in a proc directly called by TclInvokeObjectCommand:

  proc f {} {tailcall g}
  proc g {} {puts GGG}

  /* in C */ Tcl_GetCommandInfo(interp,"f",&info);
  /* in C */ (*info.proc)(info.clientData,interp...);

At levels 2 and above, everything is fine:

  proc h {} f

  /* in C */ Tcl_GetCommandInfo(interp,"h",&info);
  /* in C */ (*info.proc)(info.clientData,interp...);

Note that this needle was in the haystack of a Tk embedding scenario: "lablTk", which is a prominent Tk binding for OCaml.

It turns out that in Unix, the color picker and other similar nonnative modal dialogs have a level-1 tailcall:

  # in tk.tcl
  if {![llength [info commands tk_chooseColor]]} {
      proc ::tk_chooseColor {args} {
  	tailcall ::tk::dialog::color:: {*}$args
User Comments: msofer added on 2014-05-20 17:58:44:
dkf's suggestion is correct, IMHO. It is also what I am trying to implement, buthaven't yet found a way and/or the time to do it properly.

Note that this seems to be an issue only for commands created via [proc] and friends, as [tailcall] only tries to do its thing when invoked from within a proc body. Given that tcl's core only uses the nreProc pointer for procs, the way to go may well be to concoct an objProc that knows how to handle [tailcall] on its own - that is, without relying on the nre machinery.

dkf added on 2014-05-20 17:32:38:

Since we can detect the situation with tailcall, we can (informatively) prevent it from crashing. Failing that, we can at least make it cleanly Tcl_Panic with a description that makes sense to downstream users, rather than saying something which only Miguel understands, but doing this in Tcl errors would indeed be best. These are the steps that should be taken prior to doing anything to actually make the OCaml work…

But I want to continue to explore other options to changing those procedures because forcing Tk to Do The Wrong Thing™ because of third party API abuse really strikes me as a massively retrograde step.

One possible fix would be to make procedures install a string command implementation (which Tcl wouldn't normally use) that does “the right thing”, delegating through Tcl_EvalObjv if there's no context. That wouldn't interfere with anything, since we never promised at all that procedures would not do such a thing (nor would anyone poking their fingers in really be thinking to look for such a trick).

We still need to fix things. It's just that right now, we can't say that the OCaml code won't try to use the trick with calling procedures they've created themselves that contain tailcall; we cannot possibly prevent them from doing that.

ferrieux added on 2014-05-19 17:54:23:
@Miguel: fully agreed, the core of the issue is an API that should not have been exposed in the first place. Making it baby-proof is not an option.

So I suggest to:

 - not touch T_GCI, and deprecate it in a future version
 - tell Brent to publish a revision without the anti-idiom
 - when the issue occurs, let tailcall err instead of panicking (if possible.. your call)
 - use the patch below and associated test for inclusion in the test suite
 - protect Tk as Jan suggested, since Tk is not a member of the Tcl test suite after all, and we want to minimize the risk of other bindings encountering the same problem (while applying a recipe from an authoritative book)

ferrieux added on 2014-05-19 17:48:03:
As requested, a patch to tclTest that demonstrates the panic in script.
The trick is, as Miguel predicted, to kickstart a slave with the T_GCI and proc pointer call.

  interp create foo
  interp eval foo {
    proc f {} {tailcall g}
    proc g {} {puts ggg}
  testcmdinfo getandcall foo f

You might say "ad hoc". It is :}
If however you consider it a non-joking test, I'll make it so.

Index: generic/tclTest.c
--- generic/tclTest.c
+++ generic/tclTest.c
@@ -1030,11 +1030,11 @@
     int argc,			/* Number of arguments. */
     const char **argv)		/* Argument strings. */
     Tcl_CmdInfo info;
-    if (argc != 3) {
+    if (argc < 3) {
 	Tcl_AppendResult(interp, "wrong # args: should be \"", argv[0],
 		" option cmdName\"", NULL);
 	return TCL_ERROR;
     if (strcmp(argv[1], "create") == 0) {
@@ -1071,10 +1071,32 @@
 	if (info.isNativeObjectProc) {
 	    Tcl_AppendResult(interp, " nativeObjectProc", NULL);
 	} else {
 	    Tcl_AppendResult(interp, " stringProc", NULL);
+    } else if (strcmp(argv[1], "getandcall") == 0) {
+        Tcl_Interp *slave;
+        if (argc<4) {
+            Tcl_AppendResult(interp, "wrong # args: should be \"", argv[0], argv[1],
+		" interp cmd [args]\"", NULL);
+	    return TCL_ERROR;
+        }
+        slave = Tcl_GetSlave(interp, argv[2]);
+        if (slave == NULL) {
+            return TCL_ERROR;
+        }
+	if (Tcl_GetCommandInfo(slave, argv[3], &info) ==0) {
+	    Tcl_SetResult(interp, "??", TCL_STATIC);
+	    return TCL_OK;
+	}
+	if (info.proc == NULL) {
+            Tcl_SetResult(interp, "NULLPROC", TCL_STATIC);
+	    return TCL_OK;
+	}
+        (*info.proc)(info.clientData,slave,argc-3,argv+3);
+        return TCL_OK;
     } else if (strcmp(argv[1], "modify") == 0) {
 	info.proc = CmdProc2;
 	info.clientData = (ClientData) "new_command_data";
 	info.objProc = NULL;
 	info.objClientData = NULL;

msofer added on 2014-05-19 16:53:57:
There is a fundamental problem in here: if we "fix" things so that the pointer returned by Tcl_GetCommandInfo is always directly callable, then it cannot be the same pointer as set at command creation or using Tcl_SetCommandInfo. One way or the other, some fundamental expectation of api users will be violated.

dkf added on 2014-05-19 16:16:46:

Jan, that's a bandage over the problem, not a fix. I want to see this fixed, even if the fix still issues an error in this case (instead of crashing). But I really want to see a test; we can't guarantee that we won't have this sort of problem again in the future without a test. It might be that the right approach would be to write a different way of doing the invoke that detects that things are without a Tcl_Eval* frame and inserts one in that case. We are, after all, talking about a case that won't ever be called from normal Tcl code.

Only once we've stopped the crash can we think in terms of making things actually work for the OCaml people.

jan.nijtmans added on 2014-05-19 13:46:52:

Possible solution here: 3322b5d9be

Since tailcall is causing the problem here, and a real solution might otherwise delay a Tk 8.6.2, I wonder if the cleaner stack is really worth the trouble. Tk 8.5 did it this way, and no-one ever complained (as far as I am aware).

ferrieux added on 2014-05-18 09:11:23:
I suspect that the Ocaml guys chose *that* way of calling into Tcl because, among all the Tcl_Eval* variants, that's the one you can call with a char **argv instead of a single string. Since they start with a list on their side, it makes senses to avoid the extra work of Tcl-stringification of the list (which is not itself an obvious offer of the Tcl API to non-Tcl_Obj-acquainted clients).

Preventing such (ab)use is tricky. Maybe demote Tcl_GetCommandInfo from the public stubs ?

dkf added on 2014-05-18 08:43:19:

They're calling the implementation directly?   O_o   Massively excessive “smartness” detected.

Is there some way to make a test in the Tcl test suite? (And yes, we should at least not crash in this case, but I would really want to have a test first.)

ferrieux added on 2014-05-17 18:17:22:
Starting to see the light.

Single-stepping in TclSetTailCall in the offending context, shows that upon entry, TOPCB=NULL. Though I'm far from familiar enough with the NRE to even start appreciating how wrong that is... I can use a brute-force grep, which shows that the only spots where TOPCB is set are TclNRAddCallback and TclNRRunCallbacks.

So it seems that LablTk's capital offense was to somehow call Tcl's evaluator *without* the NRE (which should be expected from code predating the NRE).

In hindsight, it also seems normal for [tailcall] to fail when no trampoline is below. Now what ?

 (1) is there a vanilla-tclsh way of reproducing this situation of a "non-NRE-eval" of a proc with a tail call ?

 (2) instead of panicking, what about raising an error ?

 (3) knowing this, those luxurious tailcalls in Tk should be removed, unless there's a stronger hidden reason than "cleaner error stacks".

Reactions ?

ferrieux added on 2014-05-17 09:24:31:
Updated title and description to reflect the fact that Tcl_Eval is not involved at all.

ferrieux added on 2014-05-16 22:19:47:
OK, I've dug. Compiling both lablTk and Tcl in -g sheds a bit more light:

(gdb) where
#10 0x00007fa2b074b58f in TclInvokeObjectCommand (clientData=0x265fea0, interp=0x25ce440, argc=5, argv=0x2625ad0) at /home/alex/src/fos/tcl/generic/tclBasic.c:2466
#11 0x00007fa2b0b31398 in camltk_tcl_direct_eval (v=140336735435760) at cltkEval.c:210

And in fact the lablTk caller does *not* use string-based Tcl_Eval, but TclInvokeObjectCommand, as info.proc returned by Tcl_GetCommandInfo:

  if (Tcl_GetCommandInfo(cltclinterp,argv[0],&info)) { 
    if (info.proc == NULL) {
       .. (use Tcl_Eval)
     } else {
      result = (*info.proc)(info.clientData,cltclinterp,size,(const char**)argv);

So maybe to reproduce the issue we should find a way to do the same, as "barely" as possible, from tcltest. Any idea ?

ferrieux added on 2014-05-13 17:20:20:
Yes, by "level 1" in simply mean "within the body of the first Tcl proc called by the C function Tcl_Eval".

Thanks for pointing out that Tcl_EvalEx is not so different from Tcl_Eval.

From your test, I deduce that maybe there's something to the extra initializations that come with calling tclevalex itself (from script) vs. directly calling the C function out of the blue ...

Alternatively, maybe the testing context itself matters, and lablTk fails to properly intialize the interp in the first place. Would need to dig in that code, grrr...

dgp added on 2014-05-13 12:04:08:
Sadly "level" is overloaded with meaning.

$ make runtest
% proc f {} {tailcall g}
% proc g {} {puts GGG}
% testevalex f

Works for me.

I'm guessing that script isn't really at what you
mean by "level 1" to make the demonstration?