Ticket UUID: | debd088e48998a758f2927b67c132341954428ea | |||
Title: | llength split is different from string length | |||
Type: | Bug | Version: | 8.6.11 | |
Submitter: | kyak | Created on: | 2021-02-11 18:48:35 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | jan.nijtmans | |
Priority: | 7 High | Severity: | Important | |
Status: | Open | Last Modified: | 2025-03-03 21:49:11 | |
Resolution: | Remind | Closed By: | nobody | |
Closed on: | 2024-08-20 07:18:53 | |||
Description: |
Starting from Tcl 8.6.11, for unicode strings, [llength [split]] is different from [string length]. Here is an example: $ echo 'puts "[info patchlevel] -- [llength [split "\ud83e\udd1d" {}]] -- [string length "\ud83e\udd1d"]"' | tclsh 8.6.10 -- 2 -- 2 $ echo 'puts "[info patchlevel] -- [llength [split "\ud83e\udd1d" {}]] -- [string length "\ud83e\udd1d"]"' | tclsh 8.6.11 -- 1 -- 2 | |||
User Comments: |
sebres added on 2025-03-03 21:49:11:
Well, in 8.6 sure... because it doesn't fully support chars outside of BMP, yes it would be better than a potential for inconsistent indices/llength/iterators (depending on type of object or on the fact whether it shimmer or not), what opens a potential to BO, segfaults, vulnerabilities etc. Let alone the example with regexp shows how the common string consistency may suffer by shimmer of object. This is much "worse" than theoretical split (I don't speak about regexp, but about the principle).
I'm open for another proposal to fix it, but as already supposed you don't have it, do you?.. But for instance, if one breaks everything anyway, then why just not set TCL_UTF_MAX to 4 and fix it like in 9.0? I know it would not be binary compat anymore, but it is nevertheless several times better and probably more sane than such fundamentally broken and serious violation.
Please don't use arguments like that here... I speak explicitly only about 8.6(.11+), which by the way IIRC has been changed in incompatible way without any TIP, in a patch-release. And I already wrote that it is not acceptable to say "fixed" in newest major release, because then you should surely revert it in 8.6 as failed attempt.
Sorry Jan, but it is also not an argument... Please, don't compare a feature (no matter how sought-after by users) with a bug, that shall be fixed ASAP. And I'm sure you known that I'm not a "user" of core-tcl already dozen years (I have my own fork)... But I still care about it, because I'm a developer of Tcl as well as also a engineer and security auditor, and therefore I see that the things are basally broken now... And I hoped that my explanation would show that having emojis doesn't justify the presenΡe of such malevolent bug... Again, with a potential to grow further, since nobody really knows how many commands, and by which constellations, are really affected. And an estimation of them would be a lot of work (but it is obvious to everyone knowing C-side of Tcls source code). Seems my hope has been in vain. OK, then I'd like to wait for other TCT members opinion to the subject. jan.nijtmans added on 2025-03-03 20:58:56:
$ tclsh8.6 % split π π % string reverse π π So you are proposing to split the π, or (worse) reverse the two halves again? That's a NO from me. But you already expected that. I'm aware that for regexp we cannot fix π to be seen as a single character. In 9.0, regexp behaves as expected. I cannot 100% satisfy every Tcl 8.6 user any more. sebres added on 2025-03-02 04:24:32: Please don't close this ticket... One can decide to split or not to splits surrogates, so Again, no matter how the decision will be and how it would be fixed (becomes 2 or 1), THEY MUST BE EQUAL. I meant, no matter whether they remains as split-able surrogates or not, but everywhere in the same way, not like now - sometimes yes, sometimes no. With other words iteration over utf-8 & unicode need to be consistent, especially in dynamic typing lang, moreover in EIAS lang, especially if objects may shimmer, etc. Just to avoid the mess like this:
and that at script level by simple shimmer from pure utf-8 to string with unicode...At C-level it may be much worse, and even pretty vulnerable, with potential to mem-access violation, buffer overrun, segfaults etc (just because one cannot rely on index/length consistence anymore). To say "chars out of BMP" are not "fully" supported in 8.6 is not quite correct - then why one touches it and rewrites in this broken way at all? pooryorick added on 2024-08-20 07:18:53: This was fixed for Tcl 9.0. See [0d61d3a2bb905178] for details. jan.nijtmans added on 2021-02-25 11:59:39: > do we yet have any kind of guide to the long-term Tcl programmer how their code should evolve. No we don't yet. Agreed that such a document should be written. I'm willing to start on that. dgp added on 2021-02-24 18:14:57: Speaking of migration pain, do we yet have any kind of guide to the long-term Tcl programmer how their code should evolve (scripts and extensions) to accommodate the shifting definition of the string value approved in TIP 389 and subsequent TIPs? Maybe I'll learn to love the disruption if I can see the deployment plan. If we don't have such a document yet, it is long overdue. dgp added on 2021-02-24 18:09:07: "Applications using Emoji.... should know better..." This presumes an app that has control over the data it receives and processes. What's happened in Tcl 8.6.11 is that there are now values that can get into Tcl that break its operations according to long-standing expectations. There are invariants that Tcl coders have been able to assume for a very long time, that are suddenly variant now, demanding new coding strategies. This imposes a new defensive duty on every app to screen its inputs looking for these toxic values to keep them out, or to recode any procedures that relied on the invariants to tolerate the new. This is a significant burden at any point in Tcl development, imposing some migration pain in the approved 8.6 -> 8.7 -> 9.0 migration, but it is really out of place that it leaked back into the 8.6.11 patch release. Jan, I understand you disagree, and that you won't fix it, but I think the search for a better way should continue. jan.nijtmans added on 2021-02-22 08:57:35: Since kyak's "tcl-telegram" application is already fixed, and splitting of emoji is (still) a bad idea, the behavior now is as it always should have been. It's unfortunate this change occurred in a patch release (sorry!), but - given this situation - it's best to keep it this way and go forward. So, "Won't fix". Applications using Emoji in Tcl 8.6 (which never officially supported that) should know better or handle it correctly .... See my commment 2021-02-15 17:53:03 below for how this "correct handling" should be done. jan.nijtmans added on 2021-02-15 20:43:40: Yes, it's unfortunate. Using string index will continue to work in Tcl 8.7, but it will break in Tcl 9.0. So if you want your code to work with Tcl 9.0, you will _have_ to change it one day. It's up to you. kyak added on 2021-02-15 18:56:08: @jan.nijtmans thanks for your suggestion! I need those unicode surrogate pairs encoded as a pair of \uXXXX\uXXXX. So this won't work: append result [format "\\U%06.6x" $code] I guess this might work: #append result [format "\\u%04.4x\\u%04.4x" [expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]] But I have troubles reading this code :) Better stick to string index, much more readable. Also, the if {$code > 0xFFFF} branch is basically only there for Tcl 8.6.11 and higher, because older Tcl version won't even hit it. Almost like a #if GCC_VERSION, but for Tcl :) jan.nijtmans added on 2021-02-15 17:53:03: @kyak: How about still using "split" but being able to handle codes > 0xFFFF: foreach char [split $str {}] { scan $char %c code if {$code > 0xFFFF} { append result [format "\\U%06.6x" $code] #append result [format "\\u%04.4x\\u%04.4x" [expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]] } elseif {$code > 127} { append result [format "\\u%04.4x3" $code] } else { append result $char } } This way, you can handle codes > 0xFFFF separately, any way you like, either using the \U?????? syntax, or split it in surrogates. This will continue to work in Tcl 8.7 and Tcl 9.0 as well! sebres added on 2021-02-13 13:52:40: The thing is for this concrete case one could surely use something else (string map, regexp, some binary encoder, whatever), but as already said the issue is about compatibility and consistency. Also binary packages using utf-routines are affected by the regression. Don confirmed in the tclchat, iterating and indexing should agree. But I think thereby important is no matter which method one would use. Regardless it is a surrogate pair or not, which char representation is considered as right and whether it is handled as single or two characters. If I would be TCT, to support emojis/etc, I had make absolutely new class of functions for example Tcl_Glyph* (and tcl ensemble e.g. [glyph]) and retain Tcl_Uft*/Tcl_Uni* (as well as [string]/etc) stuff unmodified... or even if in Tcl_Uni* then somehow optional (configurable per interp/build), so it would be compatible and expected, as well as usable there where one needs to use it (e. g. representation on display or navigation or input in some editbox). And that all no matter 8.6, 8.7 or 9.0 (not to mention that current implementation in 8.6 resembles rather a dirty workaround). As for compatibility - [22324bcbdf] (and I assume tickets surely follow) as well as suggested by Christian fix show what the issue is and which effort is needed to fix all that. Consequently Tcl would distance from utf-8 and move into unicode. Let alone a possible future shimmering of several object types to string if handling would prefer string object methods (if conversion like TclUtfToUniChar is not enough for the handling). kyak added on 2021-02-13 05:53:12: Hi Jan, You can have a look at my application and exact commit where I have to make a fix after upgrading to 8.6.11 here: https://gitlab.com/kyak/tcl-telegram/-/commit/bf5614cb429493a6d66a3a553f1821a116142c9b Basically, I am using "foreach split" to iterate over string elements and convert symbols with code > 127 to \uXXXX sequences. jan.nijtmans added on 2021-02-12 15:06:20: Thanks, @sebres, for sharing your view. Let's wait on @kyak for providing more information, how this affects him. sebres added on 2021-02-12 14:05:22:
So you mean the situation where tcl routines following unicode string representation have totally different behavior as routines following utf-8 string representation is a normal? For example using [string index]/[string length], it is possible to consider every char from surrogate pair, but [split] etc (so basically every stuff using Tcl_UtfNext, Tcl_UtPrev, Tcl_UtfToUniChar etc) missing that (deviates in behavior as for navigate or iterate over the string). Let alone such an issue in minor version upgrade is totally unexpected and can be well consider as a regression, the handling like this (different behavior depending on usage of function/command) is a nonsense, Jan. And no one language of the world behaves like this and has a weakness like this. Moreover, if you would take a look at python (basically not so good example, because python changes 2.x -> 3.x were a disaster as regards backwards compatibility), but just to note - they made a completely different switch.
So the "progress" in Tcl looks so:
% puts [info patchlevel]--[binary encode hex [encoding convertto utf-8 [split "\ud83e\udd1d" {}]]]--[binary encode hex [encoding convertto utf-8 [string index "\ud83e\udd1d" 0]]]
- 8.6.10--eda0be20edb49d--eda0be
+ 8.6.11--f09fa49d--eda0be
Anyway regardless how one would implement the conversion between unicode and utf-8 representation of surrogates pairs, the different behavior by handling across different functions and tcl commands is definitely a bug, just because it is not consistent anymore. I know where it is coming from and why or rather to cover what exactly, it was implemented this way (e. g. do you still know a "planting bombs" regression found in [bd94500678]?), but it is not a excuse for the inconsistency like this. This was lastly just a feature and not a security leak or so, and as such a feature it expecting a remaining of backwards compatibility and of course the consistency. As a bottom line regarding the whole handling (no matter which representation of both is retained at end), I don't think this different behavior on functions can be considered as a feature. Sorry, but it is definitely a bug, Jan. jan.nijtmans added on 2021-02-12 07:25:25: This change is made on purpose: When a high surrogate is followed by a low surrogate, those two form a single character (like an emoji). So, they are not supposed to be splitted, ever. But in Tcl 8.6, the internal representation of strings is in UTF-16, so such characters still have a length 2 (that will be fixed in Tcl 9.0). Just curious what problem this causes in your application. Since 8.7 has even more code to make sure that emoji cannot be splitted, your application will break anyway when making the switch to 8.7. Most likely you should have used the "unicode" encoder to convert the string to UTF-16 characters, and then do more processing from there, instead of using "split", but without looking at your application it's hard to say. Thanks! |
