Tcl Source Code

View Ticket
Login
2024-08-20
07:18 Ticket [debd088e48] llength split is different from string length status still Closed with 6 other changes artifact: d354427e8a user: pooryorick
2021-03-10
13:28 Closed ticket [0d61d3a2bb]: Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16 plus 8 other changes artifact: 7ad5084b00 user: jan.nijtmans
2021-02-25
11:59 Ticket [debd088e48] llength split is different from string length status still Closed with 5 other changes artifact: a302cbc67f user: jan.nijtmans
2021-02-24
18:14 Ticket [debd088e48]: 4 changes artifact: 150581e89d user: dgp
18:09 Ticket [debd088e48]: 5 changes artifact: 6376a57c49 user: dgp
2021-02-22
08:57 Closed ticket [debd088e48]. artifact: 79ae6cc7c6 user: jan.nijtmans
2021-02-15
20:43 Ticket [debd088e48]: 3 changes artifact: 4d6522ab28 user: jan.nijtmans
18:56 Ticket [debd088e48]: 3 changes artifact: 32d14ff5c8 user: kyak
17:53 Ticket [debd088e48]: 3 changes artifact: 34e77fdbc4 user: jan.nijtmans
2021-02-13
13:52 Ticket [debd088e48]: 3 changes artifact: 88e23725e7 user: sebres
05:53 Ticket [debd088e48]: 3 changes artifact: 3eb277f2b3 user: kyak
2021-02-12
17:42 Ticket [22324bcbdf] string reverse is broken in 8.6.11 status still Open with 5 other changes artifact: fa313cc1cc user: jan.nijtmans
15:06 Ticket [debd088e48] llength split is different from string length status still Open with 3 other changes artifact: a3029dcf90 user: jan.nijtmans
14:11 Ticket [debd088e48]: 3 changes artifact: ea54818e50 user: sebres
14:05 Open ticket [debd088e48]. artifact: 859648c476 user: sebres
07:25 Pending ticket [debd088e48]. artifact: 8f73e96130 user: jan.nijtmans
2021-02-11
18:49 Ticket [debd088e48]: 3 changes artifact: 36d519a7b1 user: kyak
18:48 New ticket [debd088e48]. artifact: b43ba4c4d7 user: kyak

Ticket UUID: debd088e48998a758f2927b67c132341954428ea
Title: llength split is different from string length
Type: Bug Version: 8.6.11
Submitter: kyak Created on: 2021-02-11 18:48:35
Subsystem: 44. UTF-8 Strings Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2024-08-20 07:18:53
Resolution: Fixed Closed By: pooryorick
    Closed on: 2024-08-20 07:18:53
Description:
Starting from Tcl 8.6.11, for unicode strings, [llength [split]] is different from [string length]. Here is an example:

$ echo 'puts "[info patchlevel] -- [llength [split "\ud83e\udd1d" {}]] -- [string length "\ud83e\udd1d"]"' | tclsh
8.6.10 -- 2 -- 2

$ echo 'puts "[info patchlevel] -- [llength [split "\ud83e\udd1d" {}]] -- [string length "\ud83e\udd1d"]"' | tclsh
8.6.11 -- 1 -- 2
User Comments: pooryorick added on 2024-08-20 07:18:53:

This was fixed for Tcl 9.0. See [0d61d3a2bb905178] for details.


jan.nijtmans added on 2021-02-25 11:59:39:

> do we yet have any kind of guide to the long-term Tcl programmer how their code should evolve.

No we don't yet. Agreed that such a document should be written. I'm willing to start on that.


dgp added on 2021-02-24 18:14:57:
Speaking of migration pain, do we yet have any kind of guide to the long-term Tcl programmer how their code should evolve (scripts and extensions) to accommodate the shifting definition of the string value approved in TIP 389 and subsequent TIPs?  Maybe I'll learn to love the disruption if I can see the deployment plan.

If we don't have such a document yet, it is long overdue.

dgp added on 2021-02-24 18:09:07:
"Applications using Emoji.... should know better..."

This presumes an app that has control over the data it receives and processes.  What's happened in Tcl 8.6.11 is that there are now values that can get into Tcl that break its operations according to long-standing expectations.  There are invariants that Tcl coders have been able to assume for a very long time, that are suddenly variant now, demanding new coding strategies. This imposes  a new defensive duty on every app to screen its inputs looking for these toxic values to keep them out, or to recode any procedures that relied on the invariants to tolerate the new.  This is a significant burden at any point in Tcl development, imposing some migration pain in the approved 8.6 -> 8.7 -> 9.0 migration, but it is really out of place that it leaked back into the 8.6.11 patch release.

Jan, I understand you disagree, and that you won't fix it, but I think the search for a better way should continue.

jan.nijtmans added on 2021-02-22 08:57:35:

Since kyak's "tcl-telegram" application is already fixed, and splitting of emoji is (still) a bad idea, the behavior now is as it always should have been. It's unfortunate this change occurred in a patch release (sorry!), but - given this situation - it's best to keep it this way and go forward.

So, "Won't fix". Applications using Emoji in Tcl 8.6 (which never officially supported that) should know better or handle it correctly .... See my commment 2021-02-15 17:53:03 below for how this "correct handling" should be done.


jan.nijtmans added on 2021-02-15 20:43:40:

Yes, it's unfortunate. Using string index will continue to work in Tcl 8.7, but it will break in Tcl 9.0. So if you want your code to work with Tcl 9.0, you will _have_ to change it one day. It's up to you.


kyak added on 2021-02-15 18:56:08:
@jan.nijtmans thanks for your suggestion!

I need those unicode surrogate pairs encoded as a pair of \uXXXX\uXXXX.

So this won't work:

append result [format "\\U%06.6x" $code]

I guess this might work:

#append result [format "\\u%04.4x\\u%04.4x" [expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]]

But I have troubles reading this code :) Better stick to string index, much more readable.

Also, the if {$code > 0xFFFF} branch is basically only there for Tcl 8.6.11 and higher, because older Tcl version won't even hit it. Almost like a #if GCC_VERSION, but for Tcl :)

jan.nijtmans added on 2021-02-15 17:53:03:

@kyak: How about still using "split" but being able to handle codes > 0xFFFF:

foreach char [split $str {}] {
	scan $char %c code
	if {$code > 0xFFFF} {
		append result [format "\\U%06.6x" $code]
		#append result [format "\\u%04.4x\\u%04.4x" [expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]]
	} elseif {$code > 127} {
		append result [format "\\u%04.4x3" $code]
	} else {
		append result $char
	}
}

This way, you can handle codes > 0xFFFF separately, any way you like, either using the \U?????? syntax, or split it in surrogates.

This will continue to work in Tcl 8.7 and Tcl 9.0 as well!


sebres added on 2021-02-13 13:52:40:

The thing is for this concrete case one could surely use something else (string map, regexp, some binary encoder, whatever), but as already said the issue is about compatibility and consistency. Also binary packages using utf-routines are affected by the regression.

Don confirmed in the tclchat, iterating and indexing should agree. But I think thereby important is no matter which method one would use. Regardless it is a surrogate pair or not, which char representation is considered as right and whether it is handled as single or two characters.

If I would be TCT, to support emojis/etc, I had make absolutely new class of functions for example Tcl_Glyph* (and tcl ensemble e.g. [glyph]) and retain Tcl_Uft*/Tcl_Uni* (as well as [string]/etc) stuff unmodified... or even if in Tcl_Uni* then somehow optional (configurable per interp/build), so it would be compatible and expected, as well as usable there where one needs to use it (e. g. representation on display or navigation or input in some editbox).

And that all no matter 8.6, 8.7 or 9.0 (not to mention that current implementation in 8.6 resembles rather a dirty workaround).

As for compatibility - [22324bcbdf] (and I assume tickets surely follow) as well as suggested by Christian fix show what the issue is and which effort is needed to fix all that. Consequently Tcl would distance from utf-8 and move into unicode. Let alone a possible future shimmering of several object types to string if handling would prefer string object methods (if conversion like TclUtfToUniChar is not enough for the handling).


kyak added on 2021-02-13 05:53:12:
Hi Jan,

You can have a look at my application and exact commit where I have to make a fix after upgrading to 8.6.11 here:

https://gitlab.com/kyak/tcl-telegram/-/commit/bf5614cb429493a6d66a3a553f1821a116142c9b

Basically, I am using "foreach split" to iterate over string elements and convert symbols with code > 127 to \uXXXX sequences.

jan.nijtmans added on 2021-02-12 15:06:20:

Thanks, @sebres, for sharing your view.

Let's wait on @kyak for providing more information, how this affects him.


sebres added on 2021-02-12 14:05:22:

So you mean the situation where tcl routines following unicode string representation have totally different behavior as routines following utf-8 string representation is a normal?
When exactly Tcl started to switch behaviour depending on string innards and used string function?

For example using [string index]/[string length], it is possible to consider every char from surrogate pair, but [split] etc (so basically every stuff using Tcl_UtfNext, Tcl_UtPrev, Tcl_UtfToUniChar etc) missing that (deviates in behavior as for navigate or iterate over the string).

Let alone such an issue in minor version upgrade is totally unexpected and can be well consider as a regression, the handling like this (different behavior depending on usage of function/command) is a nonsense, Jan. And no one language of the world behaves like this and has a weakness like this.

Moreover, if you would take a look at python (basically not so good example, because python changes 2.x -> 3.x were a disaster as regards backwards compatibility), but just to note - they made a completely different switch. So the "progress" in Tcl looks so:

% puts "[info patchlevel]--[binary encode hex [encoding convertto utf-8 "\ud83e\udd1d"]]--[string length "\ud83e\udd1d"]"
- 8.6.10--eda0beedb49d--2
+ 8.6.11--f09fa49d--2

% puts [info patchlevel]--[binary encode hex [encoding convertto utf-8 [split "\ud83e\udd1d" {}]]]--[binary encode hex [encoding convertto utf-8 [string index "\ud83e\udd1d" 0]]] - 8.6.10--eda0be20edb49d--eda0be + 8.6.11--f09fa49d--eda0be

And totally reverse picture in a python comparison:
>>> sys.version_info[0:2], u"\ud83e\udd1d".encode('utf-8', 'surrogatepass')
- ((2, 7), '\xf0\x9f\xa4\x9d')
+ ((3, 8), b'\xed\xa0\xbe\xed\xb4\x9d')
So from point of view of python how tcl handles that now would be rather a "degradation" :) at least they went in totally another direction.

Anyway regardless how one would implement the conversion between unicode and utf-8 representation of surrogates pairs, the different behavior by handling across different functions and tcl commands is definitely a bug, just because it is not consistent anymore.
This means that such changes expecting more work and at least imply a completely rework of all the function in tcl-core, let alone the commands (like split etc using Tcl_UtfNext) which should completely switch to unicode representation now.

I know where it is coming from and why or rather to cover what exactly, it was implemented this way (e. g. do you still know a "planting bombs" regression found in [bd94500678]?), but it is not a excuse for the inconsistency like this. This was lastly just a feature and not a security leak or so, and as such a feature it expecting a remaining of backwards compatibility and of course the consistency.
In [c61818e4c9] I did suggest a branch as a solution for the problem which works more consistently and solves Tcl_UtfPrev flaws more sane (e. g. in [cdffcbaec97b94d2]) with considering of backwards compatibility and remaining index/length consistency between unicode and utf-8 representation. It's a mystery to me why a finally conclusion was to switch the behavior to follow an incompatible and inconsistent way.

As a bottom line regarding the whole handling (no matter which representation of both is retained at end), I don't think this different behavior on functions can be considered as a feature. Sorry, but it is definitely a bug, Jan.


jan.nijtmans added on 2021-02-12 07:25:25:

This change is made on purpose: When a high surrogate is followed by a low surrogate, those two form a single character (like an emoji). So, they are not supposed to be splitted, ever. But in Tcl 8.6, the internal representation of strings is in UTF-16, so such characters still have a length 2 (that will be fixed in Tcl 9.0).

Just curious what problem this causes in your application. Since 8.7 has even more code to make sure that emoji cannot be splitted, your application will break anyway when making the switch to 8.7. Most likely you should have used the "unicode" encoder to convert the string to UTF-16 characters, and then do more processing from there, instead of using "split", but without looking at your application it's hard to say.

Thanks!