Tcl Source Code

View Ticket
Login
Ticket UUID: 578155d5a19b348dc1a9fe96cc2c067a59326a89
Title: Very rare bug (segfault) if set variable (with error case) using self-releasable object as new value
Type: Bug Version: >= 8.5
Submitter: sebres Created on: 2017-07-12 19:15:40
Subsystem: 07. Variables Assigned To: pooryorick
Priority: 1 Zero Severity: Minor
Status: Closed Last Modified: 2023-03-31 15:27:04
Resolution: Invalid Closed By: pooryorick
    Closed on: 2023-03-31 15:27:04
Description: (text/x-fossil-wiki)
I found a very rare but very annoying bug (segfault) that occurs if a variable is set using, for example, Tcl_ObjSetVar2 or similar.<br/>
The behaviour stems from the very old check-in [510663a99e3a096bb7bab7314eb59fc805335318] back in 2005.<br/>
I had this bug sometimes very, very sporadically, so I executed one of my Tcl services under a debugger until it no-longer occurred.

<h2>Explanation:</h2>
<ul>
<li> <code>Tcl_ObjSetVar2</code>, <code>Tcl_SetVar2Ex</code>, or similar (e. g. with flag <code>TCL_LEAVE_ERR_MSG</code>), is called to set <code>varName</code> to <code>newValue</code>.</li>
<li> <code>newValue</code> has a reference count of <code>1</code> since it is referenced somewhere in interpreter state, e.g. a sub-list, sub-dictionary, interp-result, etc. Emphasis on "only once", so <code>newValue->refCount</code> is <code>1</code>.</li>
<li> The attempt to set the variable to <code>newValue</code> produces an interim error (for example something going wrong by the resolving of <code>varname</code>, or in trace, etc)</li>
<li> The error causes some interp state such as <code>result</code>, <code>errorInfo</code>, <code>errorCode</code>, to be modified, which decrements the reference count of newValue to <code>0</code> and it is deleted, since it happened to be stored in one of those locations.</li>
<li> The problem is then the code like here - <a href="http://core.tcl.tk/tcl/artifact/3293a2dbff528bd4?ln=1458">artifact/3293a2dbff528bd4?ln=1458</a> or <a href="http://core.tcl.tk/tcl/artifact/3293a2dbff528bd4?ln=1517">artifact/3293a2dbff528bd4?ln=1517</a>, because it tries to access the already-released newValue.  In some cases the code decrements the reference count again, resulting in another attempt to release it.</li>
</ul>

Why I think, that is a bug?

Because only the caller of <code>Tcl_ObjSetVar2</code> really know that the reference of this object should be decremented or not. And because explicit decreasing inside <code>Tcl_ObjSetVar2</code> is very unexpected behavior, IMHO (because rather increased or unmodified).<br/>
Otherwise the call of <code>Tcl_ObjSetVar2</code>, <code>Tcl_SetVar2Ex</code> or similar should <b>always</b> look like this:
<code><pre>
Tcl_IncrRefCount(newValue);
Tcl_ObjSetVar2(interp, varName, NULL, newValue, ...)
Tcl_DecrRefCount(newValue);
</pre></code>
But in this case the <a href="http://core.tcl.tk/tcl/artifact/3293a2dbff528bd4?ln=1517">above-mentioned code</a> is totally unnecessary (because newValuePtr->refCount will be never 0).<br/>
And then this check-in no longer makes sense at all.

So I'm desperate, how it can be fixed plausible resp. totally backwards compatible (this check-in is older as 10 years). 

Suggestions are welcome...
User Comments: pooryorick added on 2023-03-31 15:27:04: (text/x-fossil-wiki)
"nominal condition of every function in error case (here value is not assigned to variable, dict or whatever) is don't touch the value,"

<blockquote>
No, a function that receives a Tcl_Obj may increment the reference count and then decrement it again.  The caller can protect itself from the effect by incrementing the reference count before calling the function, and decrementing it later.  sebres is conflating the idea that at the script-level a variable value nominally remains untouched with the false idea that a Tcl_Obj reference count also remains untouched.
</blockquote>

pooryorick added on 2023-03-31 15:19:19: (text/x-fossil-wiki)
"No, as already illustrated it is not."

<blockquote>
Tcl_ObjSetVar2 behaves exactly as if it had initially incremented the reference
count of newValuePtr and then decremented it in the failure case.  If there is
any complete runnable working demo that shows this not to be true, I would like
to see it, but I know there won't be one because I can read the current code.
</blockquote>

sebres added on 2023-03-31 13:05:10: (text/x-markdown)
> _"we can be not sure what happens with reference in Tcl_ObjSetVar2"_

>>    _Yes, we can be sure: The reference count of the Tcl_Obj is incremented, and in the failure case, it is decremented._

No, indirectly it is not that what you assume here - it does decrement only if it is the "last" reference (to free the object), see it again the criticized fix [510663a99e3a096](https://core.tcl-lang.org/tcl/info/510663a99e3a096bb7bab7314eb59fc805335318):
```
    } else if (dictPtr->refCount == 0) {
	Tcl_DecrRefCount(dictPtr);
    }
```
and others inclusive this one in tclptrsetvar
```
  earlyError:
    if (newValuePtr->refCount == 0) {
	Tcl_DecrRefCount(newValuePtr);
    }
    goto cleanup;
```
or in the [current code](https://core.tcl-lang.org/tcl/file?ci=tip&name=generic/tclVar.c&ln=1864-1866).

> _"Much worse it looks just like undefined behavior."_

>>    _False. Tcl_ObjSetVar2 performs the equivalent of incrementing the variable, and then decrementing it in the error case._

No, as already illustrated it is not. 

Again, I don't complain the increment and decrement, but a **destructive decrement**, what is not expected in error case. The is more or less a matter of the API responsibility, and a nominal condition of every function in error case (here value is not assigned to variable, dict or whatever) is don't touch the value, still worse to remove it on certain conditions (by default). I'd say nothing if the decrement would be non-destructive (e. g. <code>newValuePtr->refCount--</code>), so wouldn't free the object, but it does. Und exactly that I called as it were like UB.

> _"For the developer using Tcl_ObjSetVar, it is very bad up to impossible to act plausible in error case, because he don't know when he should decrease refCount in error case or not..."_

>>    _False. The caller either increments the Tcl_Obj prior to passing it to Tcl_ObjSetVar2 and arranges for it to be decremented later, or just doesn't increment it and allows Tcl_ObjSetVar2 to manage it._

But exactly that is hardly possible on certain conditions as I showed in example before. Because the decrement takes place conditionally and object gets destroyed or even not, depending on its reference counter.

> _"Thus either he do this nevertheless and can get segfault, or he produces intentionally a memory leak (quasi accepts the risk of possible leak)."_

>>    _False. If the caller wants to use the Tcl_Obj after passsing it to Tcl_ObjSetVar2, it simply increments the reference count first, and then decrements it later, or passes it back to its own caller, explaining that the reference count has been incremented._

But if one is forced always to increment the value before <code>Tcl_ObjSetVar2</code> why the trick (with decrement on refCount == 0) is necessary at all?
Again, I don't complain about that. My expectations are that in error case the object doesn't become destroyed.

...

>> _The example failed to make use of the result of Tcl_ObjSetVar2. As the documentation warns, it is not correct to simply return objPtr becuause the Tcl_Obj * returned by Tcl_ObjSetVar2 may be an entirely different object. The correct code would look like this:_

>> <i><pre><code>Tcl_Obj * CacheSetObj(Tcl_Obj *varName, Tcl_Obj *objPtr)
{
  return Tcl_ObjSetVar2(interp, cache, varName, objPtr, 0);</code>
}</code></pre></i>

>> ...

Your code example does completely different thing - now it is able to return NULL, whether my code **example** (I emphasize example) always returns an object. The assignment to variable is here just a caching mechanism which can help to avoid retrieve object again and again, but it'd work if cache is unavailable or whatever is happening with caching variable.

Again, the usage of the object after assignment is not infrequently the case and this was just a simplest example I could imagine.
Sure one can use increment before one tries to store the object to variable, but as already several times said, normally it should not be a mandatory condition at all. 
Let alone sometimes one would need to have unshared object (refCount < 2) for instance to retain it alterable, otherwise one'd start the dancing with tambourine (if shared duplicate etc pp).

>> _Notice that there is no reason for CacheSetObj to manipulate the reference count of objPtr at all, and that Tcl_ObjSetVar2 correctly cleans up objPtr if needed._

You seem don't understand the code and the intention at all.
In that example it should not clean the objPtr, never ever! Moreover set to var (assign to dict, etc) is **totaly secondary action** here, that only allows to cache (and still retain unshared) object which gets processed further.

However I have no intention to convince you or the TCT.<br/>
Bye.

pooryorick added on 2023-03-30 19:01:35: (text/x-fossil-wiki)
While diagnosing and fixing [6d4e9d1af5|memory leak: SetFsPathFromAny, assisted
by the global literal table, causes a Tcl_Obj to reference itself], I spent
days looking into this and similar reports.  I tried various experiments to
come up with an example that would confirm this report, ultimately spending a
fair chunk of time only to determine that it is baseless.  Now it's time to put
this ticket to rest so that the next developer doesn't have to spend their time
on it too.  [510663a99e3a096b] is a smart commit and a good commit:  It fixes
the real bug, as described by msofer for [1334947]:  The real bug is that not
all failure paths clear 0-ref objs.  It also made Tc_ObjSetVar2 do more so that
the caller can do less.  The "TCL_OWN_OBJREF" invention proposed by sebres
doesn't solve any known problem so there is no reason to adopt it.  The
ad-hominem assertion, "someone doesn't understand something (or doesn't want
spent more time to go deeper)" is simply another indicator that there is no
substantial argument to be made here on behalf of this report.

Now for a basic explanation of reference counting in Tcl: It is well documented
and well understood that in order to preserve a Tcl_Obj a caller increments its
reference count before passing it to another procedure, and later either
decrements the reference count or documents that it has incremented the
reference count so that its own caller can decrement it.  It's common for
procedures to increment and then decrement a Tcl_Obj reference count, and
Tcl_ObjSetVar2 does nothing more than the equivalent of that, and its not
alone.  If a procedure knows that nothing else is going to decrement the
reference count in the meantime, then the performant way to do the equivalent
of incrementing and then decremening the reference count is to only increment
it on success, and then on failure to handle the only case where decrementing
it would do anything:  The case in which the reference count is zero.  That's
all that Tcl_ObjSetVar2 does.  It's elegant and correct.

Now to rebut the statements made by sebres so far:

"we can be not sure what happens with reference in Tcl_ObjSetVar2"

<blockquote>
Yes, we can be sure:  The reference count of the Tcl_Obj is incremented, and in
the failure case, it is decremented.
</blockquote>

"The function checks the refCount only AFTER trying of set the variable (which
can change the refCount of the object by calling of the traces..."

<blockquote>
False.  As dgp explained, in the failure case Tcl_ObjSetVar2 decrements the reference count and returns before any trace runs.
</blockquote>

"Much worse it looks just like undefined behavior."

<blockquote>
False.  Tcl_ObjSetVar2 performs the equivalent of incrementing the variable,
and then decrementing it in the error case.
</blockquote>


"For the developer using Tcl_ObjSetVar, it is very bad up to impossible to act plausible in error case, because he don't know when he should decrease refCount in error case or not..."

<blockquote>
False.  The caller either increments the Tcl_Obj prior to passing it to Tcl_ObjSetVar2 and arranges for it to be decremented later, or just doesn't increment it and allows Tcl_ObjSetVar2 to manage it.
</blockquote>

"Thus either he do this nevertheless and can get segfault, or he produces intentionally a memory leak (quasi accepts the risk of possible leak)."

<blockquote>
False.  If the caller wants to use the Tcl_Obj after passsing it to
Tcl_ObjSetVar2, it simply increments the reference count first, and then
decrements it later, or passes it back to its own caller, explaining that the
reference count has been incremented.  Commit [510663a99e3a096b], which sebres
is complaining about, is actually the commit that fixed this type of issue, but
sebres doesn't seem to be aware of that.  </blockquote>

"this behavior is undocumented,"

<blockquote>
This was not false at the time it was written, but in general developers should know to increment the reference count of a Tcl_Obj before passing it to a function they don't control.
</blockquote>

"You don't know what the object "objPtr" is (it is not yours, you'll get it from foreign code and from other people),
so it can have refCount == 0, 1 and larger as 1.
How you can set it to the variable varName, inside this function and what you'll do in error case.
If you'll increment it, possibly you can (unwanted) destroy it in error case by the paired decrement.
If you don't increment it, you can not be sure what do you want to do in the "

<blockquote>
If someone else passes you a Tcl_Obj with a refCount of 0, then they are
expecting you to clean it up if needed, so incrementing it and decrementing it
is not a problem.  If you want it to be unshared in order to efficiently modify
the internal representation, then modify the Tcl_Obj and then, depending on
your needs, either increment the reference count or don't before passing it to
Tcl_ObjSetVar2, which handles the unshared object correctly in the error case.
It's starting to sound like the bug you're complaining about is in the design
of your own system.
</blockquote>

"Meanwhile I've implemented both things for my own tcl-fork, and it works fine and saves me 
many time, work and code-lines (not to mention the readability)."

<blockquote>
You are welcome to commit high-quality work.  If it's good, people will be glad
to have it in the implementation o Tcl.  No need for a bug report.  The code
will suffice.  If it's not a good contribution it will be duly reverted.
</blockquote>

"or additionally the "obscure" hack with auto-decrement in the error case will remain"

<blockquote>
It's not an obscure hack.  It's simply the equivalent of incrementing the
reference count and then decrementing it in the error case.
</blockquote>


The most recent code presented by sebres as an example of the issue he is
reporting is actually just an example of code that is itself buggy: 

<code><verbatim>
Tcl_Obj * CacheSetObj(Tcl_Obj *varName, Tcl_Obj *objPtr)
{
  Tcl_ObjSetVar2(interp, cache, varName, objPtr, 0);
  return objPtr;
}

...

if ((objPtr = CacheGetObj(someName)) != NULL)
  return objPtr;
return CacheSetObj(someName, SomethingToObtainObj(...));
</verbatim></code>

The example failed to make use of the result of Tcl_ObjSetVar2.  As the
documentation warns, it is not correct to simply return objPtr becuause the
Tcl_Obj * returned by Tcl_ObjSetVar2 may be an entirely different object.  The
correct code would look like this:

<code><verbatim>
Tcl_Obj * CacheSetObj(Tcl_Obj *varName, Tcl_Obj *objPtr)
{
  return Tcl_ObjSetVar2(interp, cache, varName, objPtr, 0);
}

...

if ((objPtr = CacheGetObj(someName)) != NULL)
  return objPtr;
return CacheSetObj(someName, SomethingToObtainObj(...));
</verbatim></code>

Notice that there is no reason for CacheSetObj to manipulate the reference
count of objPtr at all, and that Tcl_ObjSetVar2 correctly cleans up objPtr if
needed.

sebres has had over five years to come up with some working code that
demonstrates some memory leak, some segmentation fault, or some other
unexpected behaviour that justifies this report, and has so far failed.   This
report will now be closed as "invalid".  My opinion is that sebres is entirely
wrong about this issue.  The most likely explanation is that he is misusing the
Tcl API and then blaming problems on Tcl.  If he wishes to reopen this repor,
sebres should supply a complete working example that reproduces an issue.

sebres added on 2023-03-29 12:51:52: (text/x-fossil-wiki)
Just for the record, here is simplest example to illustrate the issue:
<code><pre>
Tcl_Obj * CacheSetObj(Tcl_Obj *varName, Tcl_Obj *objPtr)
{
  Tcl_ObjSetVar2(interp, cache, varName, objPtr, 0);
  return objPtr;
}

...

if ((objPtr = CacheGetObj(someName)) != NULL)
  return objPtr;
return CacheSetObj(someName, SomethingToObtainObj(...));
</pre></code>

This is completely valid scenario and it'd even work almost always properly... unless <code>SomethingToObtainObj</code> returns unreferenced object and cache doesn't exists for some reason, because then after failed assignment the object gets deleted.

Similar scenarios (like try to set to var1, if failed to var2/dict/whatever) are also pretty imaginable.
As well as vise versa situation (unwillingness to use the object after set variable): necessity to free the object in error case (but conditionally) makes it hardly possible too, because one doesn't really know the state of object (whether it already removed or not).

And again an increment and decrement around is not really a solution here, because:
<ul>
<li>it is totally unneeded action</li>
<li>it is not obvious for the caller (basically the matter of responsibility of API functions and solves an inconsistency introduced in Tcl API)<li>
<li>it cannot be seriously considered as mandatory merely on the grounds that the "fix" [510663a99e3a096bb7bab7314eb59fc805335318] becomes totally unnecessary (if one assumes only to use <code>Tcl_ObjSetVar2</code> with objects with refCount > 0)</li>.
</ul>

Rather Tcl needs some new (may be internal) function working with such a decrement, or a flag (like <code>TCL_OWN_OBJREF</code>) allowing the decrement in error case.

And inconsistent "fix" [510663a99e3a096bb7bab7314eb59fc805335318] of <code>TclPtrSetVar</code> with <code>earlyError</code> block must be definitely reverted, merely for the reason that <b>a potential for an unlikely leak is many times better than an unlikely heisenbug (double free, usage after free, reference error etc)</b>, especially if a leak, that may arise for habitude reasons, occurs only in error case.

sebres added on 2023-03-28 12:40:59: (text/x-fossil-wiki)
Please don't close it.<br/>
If someone doesn't understand something (or doesn't want spent more time to go deeper), this is not a reason to close the ticket.<br/>
Regarding the examples - the ticket contains enough of them to illustrate the wrong behavior of <code>Tcl_ObjSetVar2</code> and co in the error case.

Basically since <code>Tcl_ObjSetVar2</code> doesn't create an object and doesn't realy own an object in case of error (because it doesn't in common sense), <b>it should not decrease a reference in error case</b> at all, even less on certain conditions only (depending on the reference count of object, how it does right now) - this is dirty workaround and not a solution. Especially undocumented, see [https://www.tcl-lang.org/man/tcl/TclLib/SetVar.htm|SetVar.htm]. 
However the only way to fix it properly - use a new flag <code>TCL_OWN_OBJREF</code>, like I did it.

Additionally I'd like to note that after the fix [bc5e7bdafc922baf] (and later with a new flag <code>TCL_OWN_OBJREF</code> I implemented) I never saw this bug anymore (neither the reference error, nor the leak).

pooryorick added on 2023-03-27 22:14:44: (text/x-fossil-wiki)
No code that reproduces a problem was ever provided, and as explained by dgp, the assertions repeated by sebres don't hold up.  Now closing.

dgp added on 2021-04-26 15:08:48:
dkf, has your work getting ref count expectations documented
made a difference on this ticket?

sebres added on 2019-06-21 12:15:47: (text/x-fossil-wiki)
The next one (was recently fixed by Donal): [ca4a6f0a9595ac1f].<br/>
And I guess it is still not the last.

sebres added on 2018-01-16 12:19:12:
Append to my previous comment...

As reragds the new flag TCL_OWN_OBJREF - it should say the Tcl_ObjSetVar2, that the object 
should be owned to the variable.
So Tcl_ObjSetVar2 DOES NOT increase refCount in success case and should ALWAYS decrease the
refCount in error case. Emphasis on ALWAYS (and not just in case of no references after failed assignment).

So both codes from my previous comment will look like:

1. Transmit "ownership" of reference to variable (** don't touched in success, -1 in error case):
```
int X1 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {
  if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags | TCL_OWN_OBJREF)) == NULL) {
    /* Handle error ... */
    return TCL_ERROR;
  }
}
```

2. New reference with variable (** +1 if success, don't touched in error case):
```
int X2 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {
  if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags & ~TCL_OWN_OBJREF)) == NULL) {
    /* Handle error ... */
    return TCL_ERROR;
  }
}
```

That's it.

In this case all internal calls that submit new created object could be:
- either extended with `flags | TCL_OWN_OBJREF`
- or additionally the "obscure" hack with auto-decrement in the error case will remain, 
  but should be then definitely rewritten as suggested in branch fix-8-5-578155d5a19b348d]
  (see [bc5e7bdafc]).

The second way (auto-decrement in error case on refCount==0) is admittedly backwards-compatible,
but IMHO not good in the long term, also if it will be documented as precise as possible.

sebres added on 2018-01-16 09:46:10:
> I don't know of any counterexample, and you have not given me one.

But, I had.
So I'll try to explain my example with function X again.

Now the developer has several issues depending from initial situation:

1. If X1 "owns" the object reference (caller of X1 transferred object to X1), we should always
write the code like below:
```
int X1 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {
  Tcl_Obj *varObj;
  int orgRefCount = objPtr->refCount;
  if ((varObj = Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags)) == NULL) {
    if (orgRefCount != 0) {
      /* BUG-X1.1 - can be wrong here, because the object can be destroyed in trace */
      Tcl_DecrRefCount(objPtr);
    }
    /* Handle error ... */
    return TCL_ERROR;
  }
  if (varObj == objPtr) {
    if (orgRefCount != 0) {
      Tcl_DecrRefCount(objPtr);
    }
  } else {
    /* BUG-X1.2 - can be dangerous here, because the object can be destroyed in trace */
    if (orgRefCount != 0 && objPtr->refCount > 1) {
      Tcl_DecrRefCount(objPtr);
    }
  }
}
```

2. If X2 don't "own" the object (caller of X2 still "holds" the ownership (also indirectly, 
e. g. as member of class), it means - just assign a variable), we should always write the 
code like below:

```
int X2 (Tcl_Obj *varName, Tcl_Obj *objPtr, ...) {

  if (sharingAllowed) {

    Tcl_IncrRefCount(objPtr);
    if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags) == NULL) {
      /* BUG-X2.1 - it can produce a leak, if objPtr->refCount was decremented to 1 during 
         some trace-execution, but if we'll do it always, we can unwanted destroy object */
      if (objPtr->refCount > 1) {
        Tcl_DecrRefCount(objPtr);
      }     
      /* Handle error ... */
      return TCL_ERROR;
    }
    /* BUG-X2.2 - it can produce a leak, if objPtr->refCount was decremented to 1 during 
       some trace-execution, but if we'll do it always, we can unwanted destroy object */
    if (objPtr->refCount > 1) {
       Tcl_DecrRefCount(objPtr);
    }
    /* Handle success case ... */
    return TCL_OK;

  } else {

    if (!objPtr->refCount) {
      /* BUG-X2.3 - Impossible, we should increment it before call of Tcl_ObjSetVar2, 
         but in this case the object can get shared state after assignment to var */
    }
    if (Tcl_ObjSetVar2(interp, varName, NULL, objPtr, flags) == NULL) {
      /* Handle error ... */
      /* BUG-X2.4 - the object can be destroyed inside Tcl_ObjSetVar2, so not allowed to use anymore 
         in caller of X2 */
      return TCL_ERROR;
    }

  }
}
```

dgp added on 2018-01-13 16:45:43:
> If there's some scenario that framework will not handle properly
> I'm genuinely interested in knowing what it is.  I need a demo.

Allow me to make that more precise.

If callers of Tcl_ObjSetVar2 follow the advice I spelled out before,
there will be no memory leak, and there will be no segfault due to
a double-free of the value.  I don't know of any counterexample,
and you have not given me one.  That's what I mean when I claim
that Tcl_ObjSetVar2() is bug-free when called properly. Note that
in most cases "called properly" means do nothing about the
refcount of value at all, which is very convenient.

Now it's becoming more clear to me that your objection is a bit
different. You not only seek to avoid memleaks and double-frees,
but also seek to avoid turning unshared values into shared values,
and you're looking for an improvement that can achieve that too.

I don't fully understand your TCL_OWN_OBJREF idea, and the
branch  fix-8-5-578155d5a19b348d seems to be out of date.
Is there a patch anywhere of the change you seek?

I think at a minimum, any change to Tcl_ObjSetVar2() has to
keep the following existing calling convention operating
successfully without leaks or crashes:

if (NULL = Tcl_ObjSetVar2(interp, arr, elem, Tcl_NewObj(), flags)) {
    /* Handle error */
}

Does your proposed patch preserve this 'fire and forget' mode of
operation?  Or does it impose the burden of needing to add the
TCL_OWN_OBJREF flag to flags in order to maintain that functionality?

sebres added on 2018-01-11 22:11:35:
> 1. That's false.  Look at the code.

And?... 

1.a. You are also a bit wrong here, because there are more places, where it can be 
    (also during set, see https://core.tcl.tk/tcl/artifact/1f16d4aba6?ln=1934-1936).
1.b. You'll say that TclObjLookupVarEx can't cause decrement of reference of some object? It can! 
    (Think about the traces on get, in-place changes of internal representation of varname, etc.)
1.c. The caller of Tcl_SetVar2Ex don't know at all why this function was failed, during lookup or during 
    some trace call through set... Or something else. So what should he do???

> 2.a. Here's the framework for a caller to cover all the cases.

This scenario is well known to me and may pretty good work for simply (especially for new-created objects) 
or for the objects for which you can sure increment reference before you assign it.

But it does not good for foreign objects, which you don't "owns" resp. for which you must not 
call Tcl_IncrRefCount, for example because you must not make it shared in this case (should be 
definitely "unshared" by assignment) or because you can not be sure that assignment fails, so 
your decrement of the refCount hereafter will guaranteed destroy this foreign object.

You forget that the variable is not necessary a simple var, it can be "upvared" to 
namespace (class/object) member, linked via some "complex" resolver to some scope- or 
scope-less frame, etc.

> 2.b. Please show me the failure.

Ok, here you go: just imagine, that you are developer of function X (e. g. with public API) that looks like:

int X (Tcl_Obj *varName, Tcl_Obj *objPtr, ...);

You don't know what the object "objPtr" is (it is not yours, you'll get it from foreign code and from other people),
so it can have refCount == 0, 1 and larger as 1.
How you can set it to the variable varName, inside this function and what you'll do in error case.
If you'll increment it, possibly you can (unwanted) destroy it in error case by the paired decrement.
If you don't increment it, you can not be sure what do you want to do in the error case.
Also extra sharing of the assignment-object may be sometimes undesirable in many-many cases. At least can cause many
unwanted interim duplication's (just think about, that by assignment, the trace on set can be better executed with
unshared object, etc.). 
But in worse case may even cause errors like "...something... called with shared object" or similar).

Of course, I can also imagine some scenario's (like yours framework) for all this cases, BUT then this essentially
trivial task (just "set a var") will grow in no more trivial use-case... (and more code - more errors, at least by
people that are not very familiar with Tcl-API, so you'll really force them always to write all this around it?).

The question is WHY it should be?! So why it all should be necessary just to assign an object to the variable?

This (in my opinion controversial) behavior can be easy "fixed":

- either using special flag (like suggested below) - TCL_OWN_OBJREF, which will always decrease the refCount in error
  case (so the caller can control it resp. does the decision what should happen with him);
- or using pre-checking of refCount inside set-functions (before variable-lookup resp. assignment takes place, and
  before the object gets no more sane state in undefined error case).
- or even both to be totally backwards-compatible in all cases and to provide for developers the opportunity to decide,
  what should happen.

Meanwhile I've implemented both things for my own tcl-fork, and it works fine and saves me 
many time, work and code-lines (not to mention the readability).

dgp added on 2018-01-11 20:24:10:
>   1) The function checks the refCount only AFTER trying of set the variable
> (which can change the refCount of the object by calling of the traces, etc).

That's false.  Look at the code.

https://core.tcl.tk/tcl/artifact/1f16d4aba6?ln=1718-1728

  1718      varPtr = TclObjLookupVarEx(interp, part1Ptr, part2Ptr, flags, "set",
  1719  	    /*createPart1*/ 1, /*createPart2*/ 1, &arrayPtr);
  1720      if (varPtr == NULL) {
  1721  	if (newValuePtr->refCount == 0) {
  1722  	    Tcl_DecrRefCount(newValuePtr);
  1723  	}
  1724  	return NULL;
  1725      }
  1726  
  1727      return TclPtrSetVarIdx(interp, varPtr, arrayPtr, part1Ptr, part2Ptr,
  1728  	    newValuePtr, flags, -1);

The Decrement comes only after a failed lookup attempt, which calls no traces.
Traces are called from within TclPtrSetVarIdx(). The lookup routine also
doesn't take newValuePtr as an argument. The only "callback" opportunities
I see in the lookup routines is the invocation of variable resolvers.

> 2) For the developer using Tcl_ObjSetVar, it is very bad up to impossible
> to act plausible in error case, because he don't know when he should decrease
> refCount in error case or not, 

Here's the framework for a caller to cover all the cases.
If the caller wants to prevent any free of "value" by Tcl_ObjSetVar2(),
it can call Tcl_IncrRefCount(value) before calling Tcl_ObjSetVar2()
and call Tcl_DecrRefCount(value) after it no longer will use "value".
This will manage the lifetime without leaks.  If the caller has no need
to use "value" other than passing it to Tcl_ObjSetVar2(), then the code
below does it.

Tcl_Obj *value; /* Either create or receive as argument, any refcount. */
Tcl_Obj *stored;
Tcl_Obj *varName = Tcl_NewStringObj("foo", -1);
Tcl_IncrRefCount(varName);
stored = Tcl_ObjSetVar2(interp, varName, NULL, value, TCL_LEAVE_ERR_MSG);
Tcl_DecrRefCount(varName);
if (stored == NULL) {
    /* If value had refcount 0, it has been freed. */
    /* Otherwise, it still has the refcount it was called with. */
} else if (stored == value) {
    /* The variable "foo" holds value */
    /* The refcount of value one greater than it was called with. */
} else {
    /* The variable "foo" holds stored, which is different from value. */
    /* Traces intervened. */
    /* If value had refcount 0, it is freed. */
    /* Otherwise, it still has the refcount it was called with. */
}

This approach is very convenient for callers. In most cases they
need not deal with refcounts of "value" at all.

If there's some scenario that framework will not handle properly
I'm genuinely interested in knowing what it is.  I need a demo.
Please show me the failure.

sebres added on 2018-01-11 16:46:03:
> If you find code like this anywhere:
> if (Tcl_ObjSetVar2(interp, varObj, NULL, newObj, 0) == NULL) {Tcl_DecrRefCount(newObj);}
> ...
> But that's the only bug I see.

Well, this would be so, if the Tcl_ObjSetVar2 will get "ownership" for the object always, also in the error case.

But, the problem is very, very serious, Don.

Currently, Tcl_ObjSetVar "owns" object only in success-case, AND in error case, BUT then only if refCount is 0! AFTER the "failed" attempt of assignment, Don!

So I see at least 2 issues here:
  1) The function checks the refCount only AFTER trying of set the variable (which can change the refCount of the object by calling of the traces, etc). Much worse it looks just like undefined behavior.
  2) For the developer using Tcl_ObjSetVar, it is very bad up to impossible to act plausible in error case, because he don't know when he should decrease refCount in error case or not, especially because the object can be released within Tcl_ObjSetVar, and then the developer has normally no chance to check the refCount (resp. the vitality of the object hereafter is unknown in error case). 

Thus either he do this nevertheless and can get segfault, or he produces intentionally a memory leak (quasi accepts the risk of possible leak).

Additionally it is very confusing behavior for the function from public API in my understanding... Also if this will be "correctly" documented.

In my opinion it is wrong so it is implemented.

dgp added on 2018-01-11 15:56:32:
If you haven't already seen it, the history of the
purpose of the interface change in Tcl_ObjSetVar2()
between 8.4 and 8.5 is here

https://core.tcl.tk/tcl/tktview/1334947

It was not until much later that the corresponding change
to Tcl_SetVar2Ex() was completed properly.

https://core.tcl.tk/tcl/tktview/7368d225a6

In 8.4 it was impossible to call the variable setting
routines in a way that could fully reliably prevent
memory leaks. Error paths from failed lookup were
inconsistent with error paths from traces.
It was an unusable interface.

For 8.5, the refcount behavior of the variable setting
routines was changed to what it is now.  These routines
now act as Consumers of the "newValuePtr" argument.
Effectively they always incr the refCount, and promise
to eventually decr it.  If you pass in a 0 ref count newly
created value, this scheme will not leak it. That's the
most convenient way to use these routines.  It is possible
the newValuePtr will be freed before the variable setting
routine returns.  If the caller has a need to use that value
after the variable setting routine returns, it needs to
explicitly make its own incr of the refcount to preserve
the value and its own balancing decr.

If you find code like this anywhere:

Tcl_Obj *newObj = Tcl_NewObj();
if (Tcl_ObjSetVar2(interp, varObj, NULL, newObj, 0) == NULL) {
  Tcl_DecrRefCount(newObj);
  return TCL_ERROR;
}

It is an outdated attempt to code to the 8.4 interface.
It must be changed to work properly with Tcl 8.5 and later.
See for example,

http://core.tcl.tk/tk/tktview?name=3607326

I agree the documentation of this is not good enough. But
that's the only bug I see.

If I'm missing some scenario where that's not addressing something
important, I'm afraid I'm going to need a concrete demonstration
(not a description. Something I can run) to see it.  Best I can tell
only a misbehaving variable resolver can cause trouble, and in that
case the remedy would be to make that resolver behave.

sebres added on 2017-12-01 21:16:24:
/ping/ Any thoughts about this?...

sebres added on 2017-07-17 16:59:39: (text/x-fossil-wiki)
Well, this seems to have repercussions - today I would check my idea with new flag <code>TCL_OWN_OBJREF</code> and have verified at which places everywhere in tcl (and some modules like thread, etc) it may be needed.
Thereby I found many places, where it's currently wrong (e. g. usage of released object, wrong free or even leaks).<br/>
Too many to list all this here...<br/>
Just as an example, see <a href="https://core.tcl.tk/tcl/artifact/396c145dddedc7d0?ln=2750">Tcl_ObjSetVar2(..., matchVarObj, NULL, emptyObj, ...)</a> that will use already released object <code>emptyObj</code> if 10 lines <a href="https://core.tcl.tk/tcl/artifact/396c145dddedc7d0?ln=2740">above</a> the same object <code>emptyObj</code> will be released in trace by <code>Tcl_ObjSetVar2(..., indexVarObj, NULL, emptyObj, ...) </code>.
Note that in current versions this does not have <code>Tcl_DecrRefCount(emptyObj)</code> in error cases (since auto-release in [510663a99e3a096bb7bab7314eb59fc805335318]), but it does no matter because this can be released in trace by set.

I would like to fix all such errors (and similar) for 8.5th, 8.6th and trunk branches (together with introducing of already suggested new flag TCL_OWN_OBJREF or using some other solution like new internal function <code>TclObjOwnAndSetVar</code>), but firstly I would like to know what TCT thinks about (new flag?, new function?, something other?).
IMHO but (very-very controversial) auto-release made in [510663a99e3a096bb7bab7314eb59fc805335318] is not really a solution and should be rewritten.

Please note also, that this behavior is undocumented, so many people make still:
<code><pre>
Tcl_Obj *newObj = SomethingReturnsNewObjOfTypeX(...);
if (Tcl_ObjSetVar2(..., varObj, NULL, newObj, ...) == NULL) {
  Tcl_DecrRefCount(newObj);
  return TCL_ERROR;
}
</pre></code>
What is currently wrong (because since [510663a99e3a096bb7bab7314eb59fc805335318] it is double decreased, and can cause segfault).

sebres added on 2017-07-14 14:26:09: (text/x-fossil-wiki)
the first attempt (8.5 based) as minimalist fix made in <a href="https://core.tcl.tk/tcl/timeline?r=fix-8-5-578155d5a19b348d&nd&n=200">fix-8-5-578155d5a19b348d</a>.

sebres added on 2017-07-14 10:20:42: (text/x-fossil-wiki)
Still one argument against this code at all - if caller want to release ownership of the object (should not hold own reference anymore after set variable, the caller should currently take care by the decreasing of it.<br/>
Especially in the error case of <code>Tcl_ObjSetVar2</code> (because this time it should then always do it). How it can be sure, that <code>Tcl_ObjSetVar2</code> did it not already?
So for example:
<code><pre>
/* we get someValue from somewhere and take ownership of reference now */
someValue = TransmitObjFromSomewhere(...);
...
if (!Tcl_ObjSetVar2(interp, varName, NULL, someValue, ...)) {
  /* [**BAD**] error case - decrease and return */
  Tcl_DecrRefCount(someValue);
  /* [**/BAD**] */
  return ...;
}
/* do somewhat else also, e. g. using this object, for example: */
... Tcl_GetString(someValue) ...

/* we are ready here, release our ownership (further in var) */
Tcl_DecrRefCount(someValue);
return ...;

</pre></code>

But currently we cannot do this sane...<br/>
I mean the code enclosed between <code>[**BAD**]...[**/BAD**]</code>.<br/>
Because we can be not sure what happens with reference in <code>Tcl_ObjSetVar2</code>.

Possibly we should introduce a new flag (TCL_OWN_OBJREF) like here:
<code><pre>
<b style="color:gray">// tcl.h:</b>

<b style="color:green">+ /* Indicate the object reference supplied to Tcl_ObjSetVar2 etc should be not owned by caller anymore */</b>
<b style="color:green">+ #define TCL_OWN_OBJREF           0x800000</b>

<b style="color:gray">// everywhere where expected to transmit the reference ownership to var:</b>

<b style="color:red">- Tcl_ObjSetVar2(interp, varName, NULL, someValue, TCL_LEAVE_ERR_MSG);</b>
<b style="color:red">- Tcl_DecrRefCount(someValue);</b>
<b style="color:green">+ Tcl_ObjSetVar2(interp, varName, NULL, someValue, TCL_OWN_OBJREF | TCL_LEAVE_ERR_MSG);</b>
</pre></code>

It could then set the variable value with decreasing of the refCount by variable assignment routines (but always, not only in error case if refCount was 0).

IMHO only this would be the right way, to make it sane.

sebres added on 2017-07-12 20:04:15: (text/x-fossil-wiki)
Although the emphasis was "only once" referenced, but it is not always the case: the object newValue can be referenced twice or more (newValue->refCount > 1) and described situation can nevertheless still occur, if we've something like this:
<code><pre>
set X [list [list 1st-list $newValue] [list 2nd-list $newValue] ...]
call-something-that-set-var ::missing::namespace::var [call-something-unset-X-and-return-its-value X]
</pre></code>
I think this example describes the issue very good.

Thus my proposal to solve it (resp. to workaround it) using:
<pre><code>
<b style="color:green">+     int freeNewVal = (newValuePtr->refCount == 0);</b>

     part1 = TclGetString(part1Ptr);
     part2 = ((part2Ptr == NULL) ? NULL : TclGetString(part2Ptr));
  
     varPtr = TclObjLookupVar(interp, part1Ptr, part2, flags, "set",
            /*createPart1*/ 1, /*createPart2*/ 1, &arrayPtr);
     if (varPtr == NULL) {
<b style="color:red">-        if (newValuePtr->refCount == 0) {</b>
<b style="color:green">+        if (freeNewVal) {</b>
             Tcl_DecrRefCount(newValuePtr);
         }
         return NULL;
     }
</code></pre>