Tcl Source Code

View Ticket
Login
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.
Ticket UUID: 578155d5a19b348dc1a9fe96cc2c067a59326a89
Title: Very rare bug (segfault) if set variable (with error case) using self-releasable object as new value
Type: Bug Version: >= 8.5
Submitter: sebres Created on: 2017-07-12 19:15:40
Subsystem: 07. Variables Assigned To: nobody
Priority: 7 High Severity: Critical
Status: Open Last Modified: 2019-06-21 12:15:47
Resolution: None Closed By: nobody
    Closed on:
Description:

I found a very rare but very annoying bug (segfault), if variable will set for example using Tcl_ObjSetVar2 or similar.
In result it belongs to very old check-in [510663a99e3a096bb7bab7314eb59fc805335318] from 2005.
I had this bug sometimes very-very sporadically, so I executed one of my tcl-service under debugger until it not occurred again.

PoC:

  • somewhat will set var varName to newValue with `Tcl_ObjSetVar2`, `Tcl_SetVar2Ex` or similar (e. g. with flag TCL_LEAVE_ERR_MSG);
  • thereby this object (newValue) was only once referenced (somewhere in interpreter state, e. g. something in sub-list or sub-dictionary of interp-result, etc.). Emphasis on "only once", so newValue->refCount is 1.
  • the set produces an interim error (for example something going wrong by the resolving of the varname, or in trace, etc)
  • by the following throwing of the error-state to interp (result, errorInfo, errorCode) this will automatically decrease old object of interp-state, which can also remove all children
  • thus the newValue->refCount will be implicit decreased to 0, and object newValue will be released.
  • the problem is then the code like here - artifact/3293a2dbff528bd4?ln=1458 or artifact/3293a2dbff528bd4?ln=1517, because it tries to access already released object newValue (that does not exists anymore!) and decrease its reference again and then tries to release it again!

Why I think, that is a bug?

Because only the caller of `Tcl_ObjSetVar2` really know that the reference of this object should be decremented or not. And because explicit decreasing inside `Tcl_ObjSetVar2` is very unexpected behavior, IMHO (because rather increased or unmodified).
Otherwise the call of `Tcl_ObjSetVar2`, `Tcl_SetVar2Ex` or similar should always look like this:

Tcl_IncrRefCount(newValue);
Tcl_ObjSetVar2(interp, varName, NULL, newValue, ...)
Tcl_DecrRefCount(newValue);
But in this case the above-mentioned code is totally unnecessary (because newValuePtr->refCount will be never 0).
And then this check-in no longer makes sense at all.

So I'm desperate, how it can be fixed plausible resp. totally backwards compatible (this check-in is older as 10 years).

Suggestions are welcome...

User Comments: sebres added on 2019-06-21 12:15:47:

The next one (was recently fixed by Donal): [ca4a6f0a9595ac1f].
And I guess it is still not the last.


sebres added on 2018-01-16 12:19:12:
Append to my previous comment...

As reragds the new flag TCL_OWN_OBJREF - it should say the Tcl_ObjSetVar2, that the object 
should be owned to the variable.
So Tcl_ObjSetVar2 DOES NOT increase refCount in success case and should ALWAYS decrease the
refCount in error case. Emphasis on ALWAYS (and not just in case of no references after failed assignment).

So both codes from my previous comment will look like:

1. Transmit "ownership" of reference to variable (** don't touched in success, -1 in error case):
```
int X1 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {
  if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags | TCL_OWN_OBJREF)) == NULL) {
    /* Handle error ... */
    return TCL_ERROR;
  }
}
```

2. New reference with variable (** +1 if success, don't touched in error case):
```
int X2 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {
  if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags & ~TCL_OWN_OBJREF)) == NULL) {
    /* Handle error ... */
    return TCL_ERROR;
  }
}
```

That's it.

In this case all internal calls that submit new created object could be:
- either extended with `flags | TCL_OWN_OBJREF`
- or additionally the "obscure" hack with auto-decrement in the error case will remain, 
  but should be then definitely rewritten as suggested in branch fix-8-5-578155d5a19b348d]
  (see [bc5e7bdafc]).

The second way (auto-decrement in error case on refCount==0) is admittedly backwards-compatible,
but IMHO not good in the long term, also if it will be documented as precise as possible.

sebres added on 2018-01-16 09:46:10:
> I don't know of any counterexample, and you have not given me one.

But, I had.
So I'll try to explain my example with function X again.

Now the developer has several issues depending from initial situation:

1. If X1 "owns" the object reference (caller of X1 transferred object to X1), we should always
write the code like below:
```
int X1 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {
  Tcl_Obj *varObj;
  int orgRefCount = objPtr->refCount;
  if ((varObj = Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags)) == NULL) {
    if (orgRefCount != 0) {
      /* BUG-X1.1 - can be wrong here, because the object can be destroyed in trace */
      Tcl_DecrRefCount(objPtr);
    }
    /* Handle error ... */
    return TCL_ERROR;
  }
  if (varObj == objPtr) {
    if (orgRefCount != 0) {
      Tcl_DecrRefCount(objPtr);
    }
  } else {
    /* BUG-X1.2 - can be dangerous here, because the object can be destroyed in trace */
    if (orgRefCount != 0 && objPtr->refCount > 1) {
      Tcl_DecrRefCount(objPtr);
    }
  }
}
```

2. If X2 don't "own" the object (caller of X2 still "holds" the ownership (also indirectly, 
e. g. as member of class), it means - just assign a variable), we should always write the 
code like below:

```
int X2 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {

  if (sharingAllowed) {

    Tcl_IncrRefCount(objPtr);
    if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags) == NULL) {
      /* BUG-X2.1 - it can produce a leak, if objPtr->refCount was decremented to 1 during 
         some trace-execution, but if we'll do it always, we can unwanted destroy object */
      if (objPtr->refCount > 1) {
        Tcl_DecrRefCount(objPtr);
      }     
      /* Handle error ... */
      return TCL_ERROR;
    }
    /* BUG-X2.2 - it can produce a leak, if objPtr->refCount was decremented to 1 during 
       some trace-execution, but if we'll do it always, we can unwanted destroy object */
    if (objPtr->refCount > 1) {
       Tcl_DecrRefCount(objPtr);
    }
    /* Handle success case ... */
    return TCL_OK;

  } else {

    if (!objPtr->refCount) {
      /* BUG-X2.3 - Impossible, we should increment it before call of Tcl_ObjSetVar2, 
         but in this case the object can get shared state after assignment to var */
    }
    if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags) == NULL) {
      /* Handle error ... */
      /* BUG-X2.4 - the object can be destroyed inside Tcl_ObjSetVar2, so not allowed to use anymore 
         in caller of X2 */
      return TCL_ERROR;
    }

  }
}
```

dgp added on 2018-01-13 16:45:43:
> If there's some scenario that framework will not handle properly
> I'm genuinely interested in knowing what it is.  I need a demo.

Allow me to make that more precise.

If callers of Tcl_ObjSetVar2 follow the advice I spelled out before,
there will be no memory leak, and there will be no segfault due to
a double-free of the value.  I don't know of any counterexample,
and you have not given me one.  That's what I mean when I claim
that Tcl_ObjSetVar2() is bug-free when called properly. Note that
in most cases "called properly" means do nothing about the
refcount of value at all, which is very convenient.

Now it's becoming more clear to me that your objection is a bit
different. You not only seek to avoid memleaks and double-frees,
but also seek to avoid turning unshared values into shared values,
and you're looking for an improvement that can achieve that too.

I don't fully understand your TCL_OWN_OBJREF idea, and the
branch  fix-8-5-578155d5a19b348d seems to be out of date.
Is there a patch anywhere of the change you seek?

I think at a minimum, any change to Tcl_ObjSetVar2() has to
keep the following existing calling convention operating
successfully without leaks or crashes:

if (NULL = Tcl_ObjSetVar2(interp, arr, elem, Tcl_NewObj(), flags)) {
    /* Handle error */
}

Does your proposed patch preserve this 'fire and forget' mode of
operation?  Or does it impose the burden of needing to add the
TCL_OWN_OBJREF flag to flags in order to maintain that functionality?

sebres added on 2018-01-11 22:11:35:
> 1. That's false.  Look at the code.

And?... 

1.a. You are also a bit wrong here, because there are more places, where it can be 
    (also during set, see https://core.tcl.tk/tcl/artifact/1f16d4aba6?ln=1934-1936).
1.b. You'll say that TclObjLookupVarEx can't cause decrement of reference of some object? It can! 
    (Think about the traces on get, in-place changes of internal representation of varname, etc.)
1.c. The caller of Tcl_SetVar2Ex don't know at all why this function was failed, during lookup or during 
    some trace call through set... Or something else. So what should he do???

> 2.a. Here's the framework for a caller to cover all the cases.

This scenario is well known to me and may pretty good work for simply (especially for new-created objects) 
or for the objects for which you can sure increment reference before you assign it.

But it does not good for foreign objects, which you don't "owns" resp. for which you must not 
call Tcl_IncrRefCount, for example because you must not make it shared in this case (should be 
definitely "unshared" by assignment) or because you can not be sure that assignment fails, so 
your decrement of the refCount hereafter will guaranteed destroy this foreign object.

You forget that the variable is not necessary a simple var, it can be "upvared" to 
namespace (class/object) member, linked via some "complex" resolver to some scope- or 
scope-less frame, etc.

> 2.b. Please show me the failure.

Ok, here you go: just imagine, that you are developer of function X (e. g. with public API) that looks like:

int X (Tcl_Obj *varName, Tcl_Obj *objPtr, ...);

You don't know what the object "objPtr" is (it is not yours, you'll get it from foreign code and from other people),
so it can have refCount == 0, 1 and larger as 1.
How you can set it to the variable varName, inside this function and what you'll do in error case.
If you'll increment it, possibly you can (unwanted) destroy it in error case by the paired decrement.
If you don't increment it, you can not be sure what do you want to do in the error case.
Also extra sharing of the assignment-object may be sometimes undesirable in many-many cases. At least can cause many
unwanted interim duplication's (just think about, that by assignment, the trace on set can be better executed with
unshared object, etc.). 
But in worse case may even cause errors like "...something... called with shared object" or similar).

Of course, I can also imagine some scenario's (like yours framework) for all this cases, BUT then this essentially
trivial task (just "set a var") will grow in no more trivial use-case... (and more code - more errors, at least by
people that are not very familiar with Tcl-API, so you'll really force them always to write all this around it?).

The question is WHY it should be?! So why it all should be necessary just to assign an object to the variable?

This (in my opinion controversial) behavior can be easy "fixed":

- either using special flag (like suggested below) - TCL_OWN_OBJREF, which will always decrease the refCount in error
  case (so the caller can control it resp. does the decision what should happen with him);
- or using pre-checking of refCount inside set-functions (before variable-lookup resp. assignment takes place, and
  before the object gets no more sane state in undefined error case).
- or even both to be totally backwards-compatible in all cases and to provide for developers the opportunity to decide,
  what should happen.

Meanwhile I've implemented both things for my own tcl-fork, and it works fine and saves me 
many time, work and code-lines (not to mention the readability).

dgp added on 2018-01-11 20:24:10:
>   1) The function checks the refCount only AFTER trying of set the variable
> (which can change the refCount of the object by calling of the traces, etc).

That's false.  Look at the code.

https://core.tcl.tk/tcl/artifact/1f16d4aba6?ln=1718-1728

  1718      varPtr = TclObjLookupVarEx(interp, part1Ptr, part2Ptr, flags, "set",
  1719  	    /*createPart1*/ 1, /*createPart2*/ 1, &arrayPtr);
  1720      if (varPtr == NULL) {
  1721  	if (newValuePtr->refCount == 0) {
  1722  	    Tcl_DecrRefCount(newValuePtr);
  1723  	}
  1724  	return NULL;
  1725      }
  1726  
  1727      return TclPtrSetVarIdx(interp, varPtr, arrayPtr, part1Ptr, part2Ptr,
  1728  	    newValuePtr, flags, -1);

The Decrement comes only after a failed lookup attempt, which calls no traces.
Traces are called from within TclPtrSetVarIdx(). The lookup routine also
doesn't take newValuePtr as an argument. The only "callback" opportunities
I see in the lookup routines is the invocation of variable resolvers.

> 2) For the developer using Tcl_ObjSetVar, it is very bad up to impossible
> to act plausible in error case, because he don't know when he should decrease
> refCount in error case or not, 

Here's the framework for a caller to cover all the cases.
If the caller wants to prevent any free of "value" by Tcl_ObjSetVar2(),
it can call Tcl_IncrRefCount(value) before calling Tcl_ObjSetVar2()
and call Tcl_DecrRefCount(value) after it no longer will use "value".
This will manage the lifetime without leaks.  If the caller has no need
to use "value" other than passing it to Tcl_ObjSetVar2(), then the code
below does it.

Tcl_Obj *value; /* Either create or receive as argument, any refcount. */
Tcl_Obj *stored;
Tcl_Obj *varName = Tcl_NewStringObj("foo", -1);
Tcl_IncrRefCount(varName);
stored = Tcl_ObjSetVar2(interp, varName, NULL, value, TCL_LEAVE_ERR_MSG);
Tcl_DecrRefCount(varName);
if (stored == NULL) {
    /* If value had refcount 0, it has been freed. */
    /* Otherwise, it still has the refcount it was called with. */
} else if (stored == value) {
    /* The variable "foo" holds value */
    /* The refcount of value one greater than it was called with. */
} else {
    /* The variable "foo" holds stored, which is different from value. */
    /* Traces intervened. */
    /* If value had refcount 0, it is freed. */
    /* Otherwise, it still has the refcount it was called with. */
}

This approach is very convenient for callers. In most cases they
need not deal with refcounts of "value" at all.

If there's some scenario that framework will not handle properly
I'm genuinely interested in knowing what it is.  I need a demo.
Please show me the failure.

sebres added on 2018-01-11 16:46:03:
> If you find code like this anywhere:
> if (Tcl_ObjSetVar2(interp, varObj, NULL, newObj, 0) == NULL) {Tcl_DecrRefCount(newObj);}
> ...
> But that's the only bug I see.

Well, this would be so, if the Tcl_ObjSetVar2 will get "ownership" for the object always, also in the error case.

But, the problem is very, very serious, Don.

Currently, Tcl_ObjSetVar "owns" object only in success-case, AND in error case, BUT then only if refCount is 0! AFTER the "failed" attempt of assignment, Don!

So I see at least 2 issues here:
  1) The function checks the refCount only AFTER trying of set the variable (which can change the refCount of the object by calling of the traces, etc). Much worse it looks just like undefined behavior.
  2) For the developer using Tcl_ObjSetVar, it is very bad up to impossible to act plausible in error case, because he don't know when he should decrease refCount in error case or not, especially because the object can be released within Tcl_ObjSetVar, and then the developer has normally no chance to check the refCount (resp. the vitality of the object hereafter is unknown in error case). 

Thus either he do this nevertheless and can get segfault, or he produces intentionally a memory leak (quasi accepts the risk of possible leak).

Additionally it is very confusing behavior for the function from public API in my understanding... Also if this will be "correctly" documented.

In my opinion it is wrong so it is implemented.

dgp added on 2018-01-11 15:56:32:
If you haven't already seen it, the history of the
purpose of the interface change in Tcl_ObjSetVar2()
between 8.4 and 8.5 is here

https://core.tcl.tk/tcl/tktview/1334947

It was not until much later that the corresponding change
to Tcl_SetVar2Ex() was completed properly.

https://core.tcl.tk/tcl/tktview/7368d225a6

In 8.4 it was impossible to call the variable setting
routines in a way that could fully reliably prevent
memory leaks. Error paths from failed lookup were
inconsistent with error paths from traces.
It was an unusable interface.

For 8.5, the refcount behavior of the variable setting
routines was changed to what it is now.  These routines
now act as Consumers of the "newValuePtr" argument.
Effectively they always incr the refCount, and promise
to eventually decr it.  If you pass in a 0 ref count newly
created value, this scheme will not leak it. That's the
most convenient way to use these routines.  It is possible
the newValuePtr will be freed before the variable setting
routine returns.  If the caller has a need to use that value
after the variable setting routine returns, it needs to
explicitly make its own incr of the refcount to preserve
the value and its own balancing decr.

If you find code like this anywhere:

Tcl_Obj *newObj = Tcl_NewObj();
if (Tcl_ObjSetVar2(interp, varObj, NULL, newObj, 0) == NULL) {
  Tcl_DecrRefCount(newObj);
  return TCL_ERROR;
}

It is an outdated attempt to code to the 8.4 interface.
It must be changed to work properly with Tcl 8.5 and later.
See for example,

http://core.tcl.tk/tk/tktview?name=3607326

I agree the documentation of this is not good enough. But
that's the only bug I see.

If I'm missing some scenario where that's not addressing something
important, I'm afraid I'm going to need a concrete demonstration
(not a description. Something I can run) to see it.  Best I can tell
only a misbehaving variable resolver can cause trouble, and in that
case the remedy would be to make that resolver behave.

sebres added on 2017-12-01 21:16:24:
/ping/ Any thoughts about this?...

sebres added on 2017-07-17 16:59:39:

Well, this seems to have repercussions - today I would check my idea with new flag TCL_OWN_OBJREF and have verified at which places everywhere in tcl (and some modules like thread, etc) it may be needed. Thereby I found many places, where it's currently wrong (e. g. usage of released object, wrong free or even leaks).
Too many to list all this here...
Just as an example, see Tcl_ObjSetVar2(..., matchVarObj, NULL, emptyObj, ...) that will use already released object emptyObj if 10 lines above the same object emptyObj will be released in trace by Tcl_ObjSetVar2(..., indexVarObj, NULL, emptyObj, ...) . Note that in current versions this does not have Tcl_DecrRefCount(emptyObj) in error cases (since auto-release in [510663a99e3a096bb7bab7314eb59fc805335318]), but it does no matter because this can be released in trace by set.

I would like to fix all such errors (and similar) for 8.5th, 8.6th and trunk branches (together with introducing of already suggested new flag TCL_OWN_OBJREF or using some other solution like new internal function TclObjOwnAndSetVar), but firstly I would like to know what TCT thinks about (new flag?, new function?, something other?). IMHO but (very-very controversial) auto-release made in [510663a99e3a096bb7bab7314eb59fc805335318] is not really a solution and should be rewritten.

Please note also, that this behavior is undocumented, so many people make still:

Tcl_Obj *newObj = SomethingReturnsNewObjOfTypeX(...);
if (Tcl_ObjSetVar2(..., varObj, NULL, newObj, ...) == NULL) {
  Tcl_DecrRefCount(newObj);
  return TCL_ERROR;
}
What is currently wrong (because since [510663a99e3a096bb7bab7314eb59fc805335318] it is double decreased, and can cause segfault).


sebres added on 2017-07-14 14:26:09:

the first attempt (8.5 based) as minimalist fix made in fix-8-5-578155d5a19b348d.


sebres added on 2017-07-14 10:20:42:

Still one argument against this code at all - if caller want to release ownership of the object (should not hold own reference anymore after set variable, the caller should currently take care by the decreasing of it.
Especially in the error case of Tcl_ObjSetVar2 (because this time it should then always do it). How it can be sure, that Tcl_ObjSetVar2 did it not already? So for example:

/* we get someValue from somewhere and take ownership of reference now */
someValue = TransmitObjFromSomewhere(...);
...
if (!Tcl_ObjSetVar2(interp, varName, NULL, someValue, ...)) {
  /* [**BAD**] error case - decrease and return */
  Tcl_DecrRefCount(someValue);
  /* [**/BAD**] */
  return ...;
}
/* do somewhat else also, e. g. using this object, for example: */
... Tcl_GetString(someValue) ...

/* we are ready here, release our ownership (further in var) */ Tcl_DecrRefCount(someValue); return ...;

But currently we cannot do this sane...
I mean the code enclosed between [**BAD**]...[**/BAD**].
Because we can be not sure what happens with reference in Tcl_ObjSetVar2.

Possibly we should introduce a new flag (TCL_OWN_OBJREF) like here:

// tcl.h:

+ /* Indicate the object reference supplied to Tcl_ObjSetVar2 etc should be not owned by caller anymore */ + #define TCL_OWN_OBJREF 0x800000

// everywhere where expected to transmit the reference ownership to var:

- Tcl_ObjSetVar2(interp, varName, NULL, someValue, TCL_LEAVE_ERR_MSG); - Tcl_DecrRefCount(someValue); + Tcl_ObjSetVar2(interp, varName, NULL, someValue, TCL_OWN_OBJREF | TCL_LEAVE_ERR_MSG);

It could then set the variable value with decreasing of the refCount by variable assignment routines (but always, not only in error case if refCount was 0).

IMHO only this would be the right way, to make it sane.


sebres added on 2017-07-12 20:04:15:

Although the emphasis was "only once" referenced, but it is not always the case: the object newValue can be referenced twice or more (newValue->refCount > 1) and described situation can nevertheless still occur, if we've something like this:

set X [list [list 1st-list $newValue] [list 2nd-list $newValue] ...]
call-something-that-set-var ::missing::namespace::var [call-something-unset-X-and-return-its-value X]
I think this example describes the issue very good.

Thus my proposal to solve it (resp. to workaround it) using:


+     int freeNewVal = (newValuePtr->refCount == 0);
part1 = TclGetString(part1Ptr); part2 = ((part2Ptr == NULL) ? NULL : TclGetString(part2Ptr));
varPtr = TclObjLookupVar(interp, part1Ptr, part2, flags, "set", /*createPart1*/ 1, /*createPart2*/ 1, &arrayPtr); if (varPtr == NULL) { - if (newValuePtr->refCount == 0) { + if (freeNewVal) { Tcl_DecrRefCount(newValuePtr); } return NULL; }