Ticket UUID: | debd088e48998a758f2927b67c132341954428ea | |||
Title: | llength split is different from string length | |||
Type: | Bug | Version: | 8.6.11 | |
Submitter: | kyak | Created on: | 2021-02-11 18:48:35 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | jan.nijtmans | |
Priority: | 5 Medium | Severity: | Minor | |
Status: | Closed | Last Modified: | 2024-08-20 07:18:53 | |
Resolution: | Fixed | Closed By: | pooryorick | |
Closed on: | 2024-08-20 07:18:53 | |||
Description: |
Starting from Tcl 8.6.11, for unicode strings, [llength [split]] is different from [string length]. Here is an example: $ echo 'puts "[info patchlevel] -- [llength [split "\ud83e\udd1d" {}]] -- [string length "\ud83e\udd1d"]"' | tclsh 8.6.10 -- 2 -- 2 $ echo 'puts "[info patchlevel] -- [llength [split "\ud83e\udd1d" {}]] -- [string length "\ud83e\udd1d"]"' | tclsh 8.6.11 -- 1 -- 2 | |||
User Comments: |
pooryorick added on 2024-08-20 07:18:53:
This was fixed for Tcl 9.0. See [0d61d3a2bb905178] for details. jan.nijtmans added on 2021-02-25 11:59:39: > do we yet have any kind of guide to the long-term Tcl programmer how their code should evolve. No we don't yet. Agreed that such a document should be written. I'm willing to start on that. dgp added on 2021-02-24 18:14:57: Speaking of migration pain, do we yet have any kind of guide to the long-term Tcl programmer how their code should evolve (scripts and extensions) to accommodate the shifting definition of the string value approved in TIP 389 and subsequent TIPs? Maybe I'll learn to love the disruption if I can see the deployment plan. If we don't have such a document yet, it is long overdue. dgp added on 2021-02-24 18:09:07: "Applications using Emoji.... should know better..." This presumes an app that has control over the data it receives and processes. What's happened in Tcl 8.6.11 is that there are now values that can get into Tcl that break its operations according to long-standing expectations. There are invariants that Tcl coders have been able to assume for a very long time, that are suddenly variant now, demanding new coding strategies. This imposes a new defensive duty on every app to screen its inputs looking for these toxic values to keep them out, or to recode any procedures that relied on the invariants to tolerate the new. This is a significant burden at any point in Tcl development, imposing some migration pain in the approved 8.6 -> 8.7 -> 9.0 migration, but it is really out of place that it leaked back into the 8.6.11 patch release. Jan, I understand you disagree, and that you won't fix it, but I think the search for a better way should continue. jan.nijtmans added on 2021-02-22 08:57:35: Since kyak's "tcl-telegram" application is already fixed, and splitting of emoji is (still) a bad idea, the behavior now is as it always should have been. It's unfortunate this change occurred in a patch release (sorry!), but - given this situation - it's best to keep it this way and go forward. So, "Won't fix". Applications using Emoji in Tcl 8.6 (which never officially supported that) should know better or handle it correctly .... See my commment 2021-02-15 17:53:03 below for how this "correct handling" should be done. jan.nijtmans added on 2021-02-15 20:43:40: Yes, it's unfortunate. Using string index will continue to work in Tcl 8.7, but it will break in Tcl 9.0. So if you want your code to work with Tcl 9.0, you will _have_ to change it one day. It's up to you. kyak added on 2021-02-15 18:56:08: @jan.nijtmans thanks for your suggestion! I need those unicode surrogate pairs encoded as a pair of \uXXXX\uXXXX. So this won't work: append result [format "\\U%06.6x" $code] I guess this might work: #append result [format "\\u%04.4x\\u%04.4x" [expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]] But I have troubles reading this code :) Better stick to string index, much more readable. Also, the if {$code > 0xFFFF} branch is basically only there for Tcl 8.6.11 and higher, because older Tcl version won't even hit it. Almost like a #if GCC_VERSION, but for Tcl :) jan.nijtmans added on 2021-02-15 17:53:03: @kyak: How about still using "split" but being able to handle codes > 0xFFFF: foreach char [split $str {}] { scan $char %c code if {$code > 0xFFFF} { append result [format "\\U%06.6x" $code] #append result [format "\\u%04.4x\\u%04.4x" [expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]] } elseif {$code > 127} { append result [format "\\u%04.4x3" $code] } else { append result $char } } This way, you can handle codes > 0xFFFF separately, any way you like, either using the \U?????? syntax, or split it in surrogates. This will continue to work in Tcl 8.7 and Tcl 9.0 as well! sebres added on 2021-02-13 13:52:40: The thing is for this concrete case one could surely use something else (string map, regexp, some binary encoder, whatever), but as already said the issue is about compatibility and consistency. Also binary packages using utf-routines are affected by the regression. Don confirmed in the tclchat, iterating and indexing should agree. But I think thereby important is no matter which method one would use. Regardless it is a surrogate pair or not, which char representation is considered as right and whether it is handled as single or two characters. If I would be TCT, to support emojis/etc, I had make absolutely new class of functions for example Tcl_Glyph* (and tcl ensemble e.g. [glyph]) and retain Tcl_Uft*/Tcl_Uni* (as well as [string]/etc) stuff unmodified... or even if in Tcl_Uni* then somehow optional (configurable per interp/build), so it would be compatible and expected, as well as usable there where one needs to use it (e. g. representation on display or navigation or input in some editbox). And that all no matter 8.6, 8.7 or 9.0 (not to mention that current implementation in 8.6 resembles rather a dirty workaround). As for compatibility - [22324bcbdf] (and I assume tickets surely follow) as well as suggested by Christian fix show what the issue is and which effort is needed to fix all that. Consequently Tcl would distance from utf-8 and move into unicode. Let alone a possible future shimmering of several object types to string if handling would prefer string object methods (if conversion like TclUtfToUniChar is not enough for the handling). kyak added on 2021-02-13 05:53:12: Hi Jan, You can have a look at my application and exact commit where I have to make a fix after upgrading to 8.6.11 here: https://gitlab.com/kyak/tcl-telegram/-/commit/bf5614cb429493a6d66a3a553f1821a116142c9b Basically, I am using "foreach split" to iterate over string elements and convert symbols with code > 127 to \uXXXX sequences. jan.nijtmans added on 2021-02-12 15:06:20: Thanks, @sebres, for sharing your view. Let's wait on @kyak for providing more information, how this affects him. sebres added on 2021-02-12 14:05:22:
So you mean the situation where tcl routines following unicode string representation have totally different behavior as routines following utf-8 string representation is a normal? For example using [string index]/[string length], it is possible to consider every char from surrogate pair, but [split] etc (so basically every stuff using Tcl_UtfNext, Tcl_UtPrev, Tcl_UtfToUniChar etc) missing that (deviates in behavior as for navigate or iterate over the string). Let alone such an issue in minor version upgrade is totally unexpected and can be well consider as a regression, the handling like this (different behavior depending on usage of function/command) is a nonsense, Jan. And no one language of the world behaves like this and has a weakness like this. Moreover, if you would take a look at python (basically not so good example, because python changes 2.x -> 3.x were a disaster as regards backwards compatibility), but just to note - they made a completely different switch.
So the "progress" in Tcl looks so:
% puts [info patchlevel]--[binary encode hex [encoding convertto utf-8 [split "\ud83e\udd1d" {}]]]--[binary encode hex [encoding convertto utf-8 [string index "\ud83e\udd1d" 0]]]
- 8.6.10--eda0be20edb49d--eda0be
+ 8.6.11--f09fa49d--eda0be
Anyway regardless how one would implement the conversion between unicode and utf-8 representation of surrogates pairs, the different behavior by handling across different functions and tcl commands is definitely a bug, just because it is not consistent anymore. I know where it is coming from and why or rather to cover what exactly, it was implemented this way (e. g. do you still know a "planting bombs" regression found in [bd94500678]?), but it is not a excuse for the inconsistency like this. This was lastly just a feature and not a security leak or so, and as such a feature it expecting a remaining of backwards compatibility and of course the consistency. As a bottom line regarding the whole handling (no matter which representation of both is retained at end), I don't think this different behavior on functions can be considered as a feature. Sorry, but it is definitely a bug, Jan. jan.nijtmans added on 2021-02-12 07:25:25: This change is made on purpose: When a high surrogate is followed by a low surrogate, those two form a single character (like an emoji). So, they are not supposed to be splitted, ever. But in Tcl 8.6, the internal representation of strings is in UTF-16, so such characters still have a length 2 (that will be fixed in Tcl 9.0). Just curious what problem this causes in your application. Since 8.7 has even more code to make sure that emoji cannot be splitted, your application will break anyway when making the switch to 8.7. Most likely you should have used the "unicode" encoder to convert the string to UTF-16 characters, and then do more processing from there, instead of using "split", but without looking at your application it's hard to say. Thanks! |