|Title:||Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16|
|Submitter:||pooryorick||Created on:||2021-03-06 18:26:33|
|Subsystem:||18. Commands M-Z||Assigned To:||jan.nijtmans|
|Status:||Closed||Last Modified:||2022-07-13 14:49:12|
|Closed on:||2022-07-13 14:49:12|
This report was inspired by [debd088e48998a75].
Various Unicode code points do not correspond to any abstract character. Such code points include private-use code points, surrogate code points, noncharacters, specials, tag characters, and unassigned code points. Although none of these code points correspond to any abstract character, the Unicode standard takes care not to forbid these code points in a Unicode string. Rather than saying that such characters and non-characters can not occur in a text, it simply labels some of them as "restricted interchange". Elsewhere, the standard explicitly permits their use for internal purposes.
A surrogate pair is a sequence of two code points used in utf-16 to represent a character beyond those that could be represented in ucs-2. When a surrogate pair is encountered during the decoding of text encoded in utf-16, it is translated into another code point.
These surrogate code points were added to Unicode to ensure no overlap between their use in utf-16 and other **external** uses.
According to the Unicode standard,
"Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. They do not correspond to Unicode scalar values and thus do not have well-formed representations in any Unicode encoding form."
The Unicode standard also states that
"A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character."
What it does not say is that a surrogate code point can not be used internally for some arbitrary purpose. We also know that the "abstract character" prohibition doesn't imply a prohibition on internal use because the Unicode standard explicitly says this about noncharacter code points:
"The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly."
What can be derived from all this is that if a code point does not map to an abstract character, in the absence of any other interpretation it should be treated merely as a code point, but not disallowed.
Tcl correctly handles the case of a single surrogate code point by simply including the code point in the string as requested by the script author:
But when two surrogate code points occur successively Tcl does something bizarre: It translates the two code points into another code point as if it was a utf-16 decoder.
This is a big mistake.
Tcl is not a utf-16 decoder. For transcoding utf-16 Tcl provides
Tcl is also not a Unicode application. Rather, it is a tool for creating Unicode applications. In fact, it is even further removed than that: It is a tool for creating tools that create Unicode applications. As such, Tcl must be **Unicode-agnostic**, not **Unicode-conformant**. As an all-purpose tool for creating Unicode-conformant applications, Tcl must provide the capability to create arbitrary sequences of Unicode code points so that application authors can create systems that meet their needs. Restricting scripts to working only with Unicode characters rather than Unicode code points deprives those scripts of significant utility.
What this means, for example, is that rather than performing normalization, Tcl should provide routines that perform whatever normalization a particular application requires. In the case at hand, though, what it means is that Tcl **should not** try to conform to the Unicode standard by becoming a quasi utf-16 transcoder. The application (script) author should be free to craft whatever Unicode strings they see fit to craft for internal use. If the author wants the HANDSHAKE character, there is an easy way to do that:
Likewise, if the author wishes to create a sequence of two code points which both happen to be surrogates, there should be an easy way to do that:
These two code points should subsequently continue to be treated as two code points. This facilitates the creation of any possible sequence of code points.
The script level is not the place to expose details of utf-16 encoding. Rather, it should remain a pure and flexible Unicode environment where it is possible to craft any sequence of code points. It is the responsibility of the application author to ultimately produce conformant text. If a script attempts to convert to utf-16 a sequence of code points that can't be represented in utf-16, Tcl should then produce and error, as this is the boundary between internal and external use.
Recent tickets hint at the troubles this new behaviour will cause. One example
is [22324bcbdf]. The incorrect assumption in this ticket is that
In short, the only time Tcl should do anything special with surrogate code
points is when it is converting to/from utf-16. Treating the
jan.nijtmans added on 2022-07-13 14:49:12:
Since TIP #619 is now Accepted and merged into 9.0, we can close this ticket.
Fixed in Tcl 9.0
pooryorick added on 2022-04-19 21:16:16:
Tip 619 looks good! Many thanks your continued work on this. Clearly, getting Tcl ready to move beyond the BMP has been a monumental effort. The future looks good!
jan.nijtmans added on 2022-04-18 17:26:50:
A proposed fix for this is put together now in TIP #619
jan.nijtmans added on 2021-10-31 11:13:15:
> Is there a plan to fix this?
In Tcl 8.7, due to the dual-value implementation of strings (UTF-8 and UTF-16), this cannot be fixed: it cannot be prevented that the internal UTF-16 implementation leaks through at script level. In Tcl 9.0, there's still room for changing this, since the internal implementation changed from UTF-16 to UTF-32. So, feel free to start a new branch based on 9.0 (and - maybe - a new TIP): this way it will become more clear what the implications of those changes are.
pooryorick added on 2021-10-28 11:44:52:
Tip 573 was withdrawn but the following script currently still evaluates to
Is there a plan to fix this?
jan.nijtmans added on 2021-03-12 15:52:58:
Since this ticket is already being hijacked anyway 😉 ...
First version of Migration guide is available now.
Don, I would like to invite you to be co-author. We can add more here as the Tcl 9 design becomes more clear.
pooryorick added on 2021-03-11 08:32:10:
I sincerely hope that rather than hacking makeshift solutions together and propagating them forward in the name of backwards compatiblity, Tcl takes the time to get the design right and implement a high-quality Unicode environment. I'm eager to help with the gruntwork wherever time affords.
chw added on 2021-03-11 00:00:24:
Or to be explicit: in order to avoid ambiguities there must be a Tcl_UniChar data type able to express at least 24 bits. Otherwise are we doomed to define TIPs over functions over procedures over conventions over API variations over defines and so on. And never find ever a proper singular solution.
chw added on 2021-03-10 22:52:22:
Indeed are we now back to square one, since as long we have Tcl_UniChar being a 16 bit entity, will we suffer from the complexities layed out some time ago in this little unspectacular piece of https://wiki.tcl-lang.org/page/Why+AndroWish+switched+to+TCL%5FUTF%5FMAX%3D6 As long as 16 bits are set, no way out of that blues.
pooryorick added on 2021-03-10 21:57:58:
I just read the proposal agaiin, and realized it contains this: One exception to this has to be made. When using the escace sequence \uD800\uDC00, so a valid combination of an upper and a lower surrogate, in a script, there is no harm in translating that to the intended Unicode code point. In Tcl 8.6 and 8.7 there is no other way than that for specifying Emoji. Allowing this, provides a upgrade path for existing scripts handling Emoji. Starting with Tcl 8.7, the "\UXXXXXX" escape sequence should be used for this. Such and exception ruins everything and brings us back to square one, where it isn't possible to compose a string of an arbitrary sequence of code points. This is too high a price to pay for the ability to enter smiley faces.
pooryorick added on 2021-03-10 21:40:48:
TIP 597 is more like it! Considering a Unicode string to be a sequence of code
points rather than a sequence of characters, and allowing a string to be
composed of any sequence of code points at all is the perfect solution to this
issue. Tcl can leave it to the various transcoders to decide which code point
sequences are allowed in the target encoding. But since any sequence of
Unicode code points is in fact Unicode,
In my opinion, rather than silently replacing code points it can't work with, an encoder should produce an error if it is handed a string it can't properly encode. The current practice of replacing invalid characters has caused all sorts of trouble, and the sooner it is eliminated, the better. Replacement can be implented at the script level if needed, but if it's baked in, there's no escaping it.
Thanks, Jan, for the revisiting the design and finding a better approach, and thanks, Don, for taking time to articulate the alternatives.
dgp added on 2021-03-10 18:01:11:
The Unicode Glossary https://unicode.org/glossary/ defines a Unicode Scalar Value. It might do us good to use that term when that's what we mean for greater clarity and conformity to language set by others. "[string is unicode $s] returns 1 when all codepoints in $s are Unicode Scalar Values, 0 otherwise." as a possible example.
jan.nijtmans added on 2021-03-10 13:28:32:
Thanks, Don, for this clear explanation.
It inspired me to write TIP #597, as a possible way out: No need to reduce the Tcl value set, but still the guarantee that produced "utf-8" conforms to the Unicode standard.
I'm closing this ticket now.
dgp added on 2021-03-09 17:00:05:
This is a dispute over the alphabet of Tcl's value set. Jan is correct to point to TIP 573 as the key issue. All Tcl values have always been (conceptually) "strings", but the precise definition of what is a "string" has varied with time/release. The last large revolution in Tcl's string value definition was the 8.0 -> 8.1 transition. In 8.0 a "string" was a sequence of zero or more bytes. In 8.1, a "string" became a sequence of zero or more UCS-2 values, which our documentation and other writings call "characters". The migration plan for Tcl 9 that Jan has developed appears to aim for a Tcl 9 definition of "string" as "a proper Unicode string". This implies the exclusion of values that are not valid Unicode. The Unicode alphabet does not include the surrogate codepoints. They cannot be included in any valid Unicode string, so on this plan they must no longer be in the Tcl value alphabet. Consequently all the existing tools that could create or manipulate these values in earlier releases have to incompatibly change, and there's a disruption for developers that has to be managed. One factor in favor of a goal to have the Tcl 9 value set be proper Unicode strings is that it moves Tcl more in the direction of using standard components and tools and names and away from its frequent (bad?) habit of inventing its own weird alternatives. That has the potential to make Tcl's code base marginally more attractive to experienced coders. I think there are two categories of reasonable criticism. First, for the first time the Tcl value set is getting reduced. There are values that were once valid in Tcl that no longer will be. This reduces the set of programs that can be written, or easily written. Manipulation of arbitrary UCS-2 sequences in a Tcl script becomes something unnatural in a language that refuses to directly represent those values. This means writing a Tcl script that does UTF-16 encoding/decoding is something that has no implementation that rests easily on the capabilities of the language. It can still be accomplished with suitable use of [binary format|scan t*] (and the inefficiencies inherent in that), but that's not the most satisfying answer to many programmers. It's a task where the language has to be worked around rather than used. From this perspective, the goal itself is flawed, and we should change course and aim to have the Tcl 9 definition of Tcl strings somehow continue to be a superset of the set of all UCS-2 sequences, and we should retain all the tools we have to operate on such values. The second category of reasonable criticism is that the Tcl 9 goal to define proper Unicode strings as Tcl's value set is acceptable, but that the current migration path to get there is bumpier than it ought to be. In that realm, the more specific the complaint, and the offering of alternatives is the best way forward. Also, in this realm, the most useful venue for hashing out the issues would be the production of migration guides for Tcl programmers. Forcing ourselves into the teaching role should force us to uncover just where the rough edges are, and whether they can be sanded down, or whether any are deal-breakers. I tend to agree with the first criticism. From pooryorick's submission I think he does too. Jan has done the work, though. He's held the TIP votes and got the requisite approvals. Tcl 8.7 behaves the way it does (mostly, modulo some errors along the way) because the TCT accepted the changes. Perhaps some of the TCT supporters of this path have not been fully aware of these issues of disruption and representation, and it might be worthwhile to raise discussions on TCLCORE, but I think that's the better venue for confronting the dispute at this point than a ticket asserting a bug. For myself, I've taken the attitude for some time now that no matter my reservations and distaste, if I cannot deliver a better alternative, there's little value in being no more than a naysayer. Pooryorick, if you want to pursue another approach, I'm willing to help where and as I can. I hope that's fair to everyone.
jan.nijtmans added on 2021-03-08 09:09:11:
Can you please change the example:
From this I'm assuming this bug report references 8.7, since 8.6 doesn't accept the \U?????-form yet for code points > U+FFFF.
Tcl indeed interprets such escape sequences as you describe. It's a feature, not a bug. Internally, Tcl internally stores strings in two possible forms, one is WTF-8, the other one is UTF-16. Unicode allows use of such forms internally, as long as when communicating with the outside world it's translated into correct UTF-8. Since 8.6.10, Tcl does this correctly when it can (when it consists of valid surrogate-pairs) but allows WTF-8 forms to escape when they contain unpaired surrogates.
I would like to deprecate such usage, see TIP #573, but I don't think this would be accepted for 8.7. Maybe for 9.0, but - still - this TIP text is premature, fully open for discussion.
For 8.x, yes it's too late, since \uD83E\uDD1D is already a valid (internal) representation for 🤝. It's the only portable way to do that in Tcl 8.6, in Tcl you can simply use the 🤝 character directly TIP #587 or you can use the \U1F91D escape sequence.
Can I close this ticket as "Fixed in 9.0"? If you find situations in which WTF-8 forms escape (like the two recent bugs Christian Werner reported), please file a bug report! I'm not claiming Tcl 8.6/8.7/9.0 is there yet (TIP #575 is being voted on now), but it's getting better and better.