Tcl Source Code

View Ticket
Login
2023-09-12
12:40 Closed ticket [b3628609ad]: concatenating strings is broken when the "bytes" field of a Tcl_Obj contains invalid data and the type is not tclStringType plus 6 other changes artifact: 695c5061ed user: jan.nijtmans
2023-05-02
19:58
Fix issue [b3628609ad73a105], by allowing TclStringCat to assume that each Tcl_Obj.bytes value is co... check-in: 29108f928e user: pooryorick tags: unchained
15:26 Pending ticket [b3628609ad]: concatenating strings is broken when the "bytes" field of a Tcl_Obj contains invalid data and the type is not tclStringType plus 3 other changes artifact: 668a3b4a34 user: pooryorick
15:25 Ticket [b3628609ad]: 3 changes artifact: 5385a97373 user: pooryorick
12:47
Fix issue [b3628609ad73a105], by allowing TclStringCat to assume that each Tcl_Obj.bytes value is co... Leaf check-in: 2ddc0c3fa9 user: pooryorick tags: pyk-b3628609ad
2023-04-23
22:37 Ticket [b3628609ad] concatenating strings is broken when the "bytes" field of a Tcl_Obj contains invalid data and the type is not tclStringType status still Open with 3 other changes artifact: d001c93fed user: pooryorick
18:50 Ticket [b3628609ad]: 3 changes artifact: 27438671e6 user: pooryorick
2023-04-22
16:17 Ticket [b3628609ad]: 3 changes artifact: 32bf1dc73f user: jan.nijtmans
05:35 Open ticket [b3628609ad]. artifact: eed28b98f0 user: apnadkarni
2023-04-21
22:14 Pending ticket [b3628609ad]. artifact: b22b9b706c user: pooryorick
22:10
Fix issue [b3628609ad73a105] by scrapping most of TclStringCat() in favor of Tcl_AppendObjToObj(). ... check-in: b699959d62 user: pooryorick tags: pyk-b3628609ad
21:25 New ticket [b3628609ad] concatening strings is broken when the "bytes" field of a Tcl_Obj contains invalid data and the type is not tclStringType. artifact: 3ab3ed9a7e user: pooryorick

Ticket UUID: b3628609ad73a105f0e51435e0e0ac5b8205fad7
Title: concatenating strings is broken when the "bytes" field of a Tcl_Obj contains invalid data and the type is not tclStringType
Type: Bug Version:
Submitter: pooryorick Created on: 2023-04-21 21:25:11
Subsystem: - New Builtin Commands Assigned To: pooryorick
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2023-09-12 12:40:23
Resolution: Invalid Closed By: jan.nijtmans
    Closed on: 2023-09-12 12:40:23
Description:

When the following script is evaluated by trunk build of tcltest, p1 produces the expected value, 2, whereas p2 produces an incorrect value, 1.

proc p1 {} {
	string length [testbytestring \xc0][testbytestring \x80]
}
proc p2 {} {
	string length [list [testbytestring \xc0]][testbytestring \x80]
}

puts [p1]
puts [p2]

For p1, TclStringCat() uses the tclStringType internal representation where \xc0 is treated as a single character and so is \x80, and therefore character concatenation is performed producing the correct result.

For p2, tclStringCat does not use tclStringType because it detects that there is another internal representation already in place. Instead, it simply concatenates the "bytes" fields of the values, and therefore byte concatenation is performed, so the two invalid byte sequences become a single valid byte sequence encoding one character.

If the body of p2 is evaluated directly instead of in the body of a procedure, the result is 2 instead of 1. This happens because there is no bytecode to execute, and instead TclEvalEx calls TclSubstTokens(), which does not call TclStringCat, but instead calls Tcl_AppendObjToObj, which replaces the list internal representation with a tclSringType internal representation, and eventually calls AppendUtfToUnicodeRep().

User Comments: jan.nijtmans added on 2023-09-12 12:40:23:

> Fixed in [b699959d62]. I looked into trying to make the ....

> Fixed in [2ddc0c3fa9] by allowing TclStringCat() to assume that all Tcl_Obj.bytes values can be safely catenated as-is.

Since TclStringCat() concatenates Unicode code-points - not bytes - this is a nun-bug. Your latest 'fix' was merged to the 'unchained' branch. Please keep it there and all is well.


pooryorick added on 2023-05-02 15:25:23:

I agree that Tcl should be able to retain efficient functions such as TclStringCat(). I think that in this case it is the test that is wrong, not TclStringCat(). Tcl never stores improperly-encoded data in Tcl_Obj.bytes, which is documented to contain data encoded in utf-8, with each nul character being represented in its overlong utf-8 form, \xC0\x80. Thus, a function like TclStringCat() should be able to assume that in the catenation of Tcl_Obj.bytes values there is no change to the interepretation of any byte in any of the catenated values.

TclStringCat was competently written with that in mind to achieve the following goals:

  • Avoid shimmering & string rep generation.
  • Produce pure bytearray when possible.
  • Error on overflow.

In 2021 the commits [fd66e6e20e] and [a7df881994] modified TclStringCat() to force the generation of a tclStringType internal representation in order to avoid the effect of combining two otherwise unrelated incomplete byte sequences into one complete sequence. This change degraded TclStringCat() with respect to its stated goals. The current issue report could probably be resolved by further degrading TclStringCat() in a similar way, but that would move Tcl further in the wrong direction of supporting incomplete values in Tcl_Obj.bytes.

As I stated a couple of years ago in [7f1162a867], any third party that stores improperly-encoded data in Tcl_Obj.bytes is creating a garbage-in/garbage-out situation, and should not expect the results of any Tcl functions thereafter to make sense. Third parties must use Tcl_ExternalTo* functions to produce properly-encoded values. If a third party directly modifies Tcl_Obj.bytes such that the last byte sequence for an encoded code point isn't complete, then it should not expect Tcl to interpret that sequence as complete.

Either third parties pay a relatively small penalty of properly-encoding values at the boundary of Tcl, or Tcl pays a relatively large penalty throughout its internal routines, and at a cost of greater code complexity.

Fixed in [2ddc0c3fa9] by allowing TclStringCat() to assume that all Tcl_Obj.bytes values can be safely catenated as-is.


jan.nijtmans added on 2023-04-22 16:17:22:

I share Ashok's concern. In addition, the build with TCL_UTF+MAX=3 was broken by this 'fix'. Therefore I moved it away from trunk.


apnadkarni added on 2023-04-22 05:35:07:

I have concerns with this fix. I suspect TclStringCat was written the way it was for performance reasons. Here is a simple benchmark.

After your fix:

% set b [testbigdata bytearray 100000]; string length $b
100000
% timerate -calibrate {}
0.03403821298619367 µs/#-overhead 0.038996 µs/# 50056827 # 25643866 #/sec
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
866.686 µs/# 1154 # 1153.8 #/sec 1000.156 net-ms
% set b [testbigdata string 100000]; string length $b
100000
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
730.523 µs/# 1369 # 1368.9 #/sec 1000.086 net-ms

Here's the same with the commit prior to your fix.

% set b [testbigdata bytearray 100000]; string length $b
100000
% timerate -calibrate {}
0.0332670332449772 µs/#-overhead 0.036267 µs/# 53823298 # 27573410 #/sec
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
28.1215 µs/# 35519 # 35560.0 #/sec 998.846 net-ms
% set b [testbigdata string 100000]; string length $b
100000
% timerate {string cat $b $b $b $b $b $b $b $b $b $b}
31.6985 µs/# 31515 # 31547.2 #/sec 998.979 net-ms

That difference is not to be dismissed. Moreover, it does not even take into account that Unicode strings will be shimmered, possibly impacting performance of further calls.

Correctness obviously trumps performance. However, here it seems it would be worthwhile to fix TclStringCat rather than throw it all away. It is also not entirely clear to me what it means that Tcl is not consistent on invalid input data and whether that needs correcting.


pooryorick added on 2023-04-21 22:14:47:

Fixed in [b699959d62]. I looked into trying to make the existing code for TclStringCat work, but it eventually became clear that it would have to mostly become a clone of Tcl_AppendObjToObj() in order to work correctly. There are a couple added benefits to this fix: First, it reduces the number of code paths that perform concatenation. Second, all callers of Tcl_AppendObjToObj will benefit from the reduction in generation of string representations.