Tcl Source Code

View Ticket
Login
Ticket UUID: b3628609ad73a105f0e51435e0e0ac5b8205fad7
Title: concatenating strings is broken when the "bytes" field of a Tcl_Obj contains invalid data and the type is not tclStringType
Type: Bug Version:
Submitter: pooryorick Created on: 2023-04-21 21:25:11
Subsystem: - New Builtin Commands Assigned To: pooryorick
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2023-09-12 12:40:23
Resolution: Invalid Closed By: jan.nijtmans
    Closed on: 2023-09-12 12:40:23
Description: (text/x-fossil-wiki)
When the following script is evaluated by trunk build of tcltest,
<code>p1</code> produces the expected value, <code>2</code>, whereas
<code>p2</code> produces an incorrect value, <code>1</code>.


<blockquote><code><verbatim>
proc p1 {} {
	string length [testbytestring \xc0][testbytestring \x80]
}
proc p2 {} {
	string length [list [testbytestring \xc0]][testbytestring \x80]
}

puts [p1]
puts [p2]
</verbatim></code></blockquote>


For <code>p1</code>, <code>TclStringCat()</code> uses the
<code>tclStringType</code> internal representation where <code>\xc0</code> is
treated as a single character and so is <code>\x80</code>, and therefore
character concatenation is performed producing the correct result.

For <code>p2</code>, <code>tclStringCat</code> does not use
<code>tclStringType</code> because it detects that there is another internal
representation already in place.  Instead, it simply concatenates the "bytes"
fields of the values, and therefore byte concatenation is performed, so the two
invalid byte sequences become a single valid byte sequence encoding one
character. 

If the body of <code>p2</code> is evaluated directly instead of in the body of
a procedure, the result is <code>2</code> instead of <code>1</code>.  This
happens because there is no bytecode to execute, and instead
<code>TclEvalEx</code> calls <code>TclSubstTokens()</code>, which does not call
<code>TclStringCat</code>, but instead calls <code>Tcl_AppendObjToObj</code>,
which replaces the list internal representation with a
<code>tclSringType</code> internal representation, and eventually calls
<code>AppendUtfToUnicodeRep()</code>.
User Comments: jan.nijtmans added on 2023-09-12 12:40:23: (text/x-fossil-wiki)
> Fixed in [b699959d62]. I looked into trying to make the ....

> Fixed in [2ddc0c3fa9] by allowing TclStringCat() to assume that all Tcl_Obj.bytes values can be safely catenated as-is.

Since TclStringCat() concatenates Unicode code-points - not bytes - this is a nun-bug. Your latest 'fix' was merged to the 'unchained' branch. Please keep it there and all is well.

pooryorick added on 2023-05-02 15:25:23: (text/x-fossil-wiki)
I agree that Tcl should be able to retain efficient functions such as
<code>TclStringCat()</code>.  I think that in this case it is the test that is
wrong, not <code>TclStringCat()</code>.  Tcl never stores improperly-encoded
data in <code>Tcl_Obj.bytes</code>, which is documented to contain data
encoded in utf-8, with each nul character being represented in its overlong
utf-8 form, <code>\xC0\x80</code>.  Thus, a function like
<code>TclStringCat()</code> should be able to assume that in the catenation of
<code>Tcl_Obj.bytes</code> values there is no change to the interepretation of
any byte in any of the catenated values.

<code>TclStringCat</code> was competently written with that in mind to achieve
the following goals:

    *  Avoid shimmering & string rep generation.
    *  Produce pure bytearray when possible.
    *  Error on overflow.

In 2021 the commits [fd66e6e20e] and [a7df881994] modified
<code>TclStringCat()</code> to force the generation of a
<code>tclStringType</code> internal representation in order to avoid the effect
of combining two otherwise unrelated incomplete byte sequences into one
complete sequence.  This change degraded <code>TclStringCat()</code> with
respect to its stated goals.  The current issue report could probably be
resolved by further degrading <code>TclStringCat()</code> in a similar way, but
that would move Tcl further in the wrong direction of supporting incomplete
values in <code>Tcl_Obj.bytes</code>.

As I stated a couple of years ago in [7f1162a867], any third party that stores
improperly-encoded data in <code>Tcl_Obj.bytes</code> is creating a
garbage-in/garbage-out situation, and should not expect the results of any Tcl
functions thereafter to make sense.  Third parties must use
<code>Tcl_ExternalTo*</code> functions to produce properly-encoded values.  If
a third party directly modifies <code>Tcl_Obj.bytes</code> such that the last
byte sequence for an encoded code point isn't complete, then it should not
expect Tcl to interpret that sequence as complete.

Either third parties pay a relatively small penalty of properly-encoding values
at the boundary of Tcl, or Tcl pays a relatively large penalty throughout its
internal routines, and at a cost of greater code complexity.

Fixed in [2ddc0c3fa9] by allowing <code>TclStringCat()</code> to assume that
all <code>Tcl_Obj.bytes</code> values can be safely catenated as-is.

jan.nijtmans added on 2023-04-22 16:17:22: (text/x-fossil-wiki)
I share Ashok's concern. In addition, the build with TCL_UTF+MAX=3 was broken by this 'fix'. Therefore I moved it away from trunk.

apnadkarni added on 2023-04-22 05:35:07: (text/x-markdown)
I have concerns with this fix. I suspect `TclStringCat` was written the way it was for performance reasons. Here is a simple benchmark.

After your fix:

```
% set b [testbigdata bytearray 100000]; string length $b
100000
% timerate -calibrate {}
0.03403821298619367 µs/#-overhead 0.038996 µs/# 50056827 # 25643866 #/sec
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
866.686 µs/# 1154 # 1153.8 #/sec 1000.156 net-ms
% set b [testbigdata string 100000]; string length $b
100000
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
730.523 µs/# 1369 # 1368.9 #/sec 1000.086 net-ms
```

Here's the same with the commit prior to your fix.

```
% set b [testbigdata bytearray 100000]; string length $b
100000
% timerate -calibrate {}
0.0332670332449772 µs/#-overhead 0.036267 µs/# 53823298 # 27573410 #/sec
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
28.1215 µs/# 35519 # 35560.0 #/sec 998.846 net-ms
% set b [testbigdata string 100000]; string length $b
100000
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
31.6985 µs/# 31515 # 31547.2 #/sec 998.979 net-ms
```

That difference is not to be dismissed. Moreover, it does not even take into account that Unicode strings will be shimmered, possibly impacting performance of further calls.

Correctness obviously trumps performance. However, here it seems it would be worthwhile to fix TclStringCat rather than throw it all away. It is also not entirely clear to me what it means that Tcl is not consistent on invalid input data and whether that needs correcting.

pooryorick added on 2023-04-21 22:14:47: (text/x-fossil-wiki)
Fixed in [b699959d62].  I looked into trying to make the existing code for TclStringCat work, but it eventually became clear that it would have to mostly become a clone of Tcl_AppendObjToObj() in order to work correctly.  There are a couple added benefits to this fix:  First, it reduces the number of code paths that perform concatenation.  Second, all callers of Tcl_AppendObjToObj will benefit from the reduction in generation of string representations.