Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Post-History:
Keywords: Tcl Tcl_UniCharToUtf()
Tcl-Version: 9.0
Tcl-Branch: tip-619
Vote-Summary: Accepted 4/0/1
Votes-For: AK, JN, KW, SL
Votes-Against: none
Votes-Present: FV
Abstract
This TIP proposes a new flag TCL_COMBINE
to be used for the
function Tcl_UniCharToUtf()
. With this flag, the function
Tcl_UniCharToUtf()
will try to combine surrogates (code
range UD800 - UDFFF), otherwise surrogates are handled
as if they were valid codepoints.
This TIP is designed as a solution for this ticket: "Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16". Currently, the internal use of UTF-16 is visible in some strange behavior at script level: Whenever a high surrogate character is glued together with a low surrogate character (which - individually - would be illegal), they form a valid Unicode character. This TIP is written to get rid of that strange behavior in Tcl 9.0. All places in the code where this special behavior is handled, it won't be handled this special way any more: surrogate characters are illegal, but they can be handled - internally - by Tcl 9.0. They won't be combined and - magically - become valid any more.
Rationale
In Tcl 9.0 currently:
$ tclsh9.0 % string length \uD83D\uDE02 1But
$ tclsh9.0 % string length \uD83D\uD83D 2Whenever surrogate pairs are encountered in a Tcl script, the surrogates are automatically combined into a single Unicode character. This cannot not be changed in Tcl 8.7 for three reasons:
- Compatibility with Tcl 8.6. In Tcl 8.6, the \U?????? representation doesn't work yet, so there is no other way to encode Emoji then by a surrogate pair \u????\u????.
- The "string" objType has an internal UTF-16 representation, which cannot distinguish between a surrogate pair and a Unicode character > U+FFFF. In Tcl 9.0, the "string" objType uses UTF-32.
- Since single surrogate characters cannot be represented in UTF-8, this opens the possibility of violations against the UTF-8 standard. But TIP #601 is accepted now, which can detect those violations.
Those 3 restrictions are gone in Tcl 9.0.
This TIP proposes that the function Tcl_UniCharToUtf()
no longer
combines surrogates by default, as it did in Tcl 8.x. Usages of the
TCL_COMBINE
flag makes Tcl_UniCharToUtf()
behave as in Tcl 8.x again.
Since surrogates are illegal in UTF-8, the effect of this TIP will be:
$ tclsh9.0 % string length \uD83D\uDE02 2 % puts \uD83D\uDE02 error writing "stdout": illegal byte sequence % encoding convertto utf-8 \uD83D\uDE02 unexpected character at index 0: 'U+00D83D'
If you really want to output invalid UTF-8, you can
use the cesu-8
encoding:
% encoding convertto cesu-8 \uD83D\uDE02 í ½í¸ % fconfigure stdout -encoding cesu-8 % puts \uD83D\uDE02 😂
The -nocomplain
option won't give the desired effect in this case:
% encoding convertto -nocomplain utf-8 \uD83D\uDE02 ��Note that � is the UTF-8-encoded form of the replacement character
�
Extensions using Tcl_UniCharToUtf()
in Tcl 8.7 always
encounter the behaviour that calling Tcl_UniCharToUtf()
with a lower surrogate followed by another Tcl_UniCharToUtf()
call with a higher surrogate will output a 4-byte UTF-8
sequence: The first call outputs a single byte, the second
call outputs the remaining 3 bytes. So, combining the two
surrogates is handled internally, and it cannot be switched off.
In Tcl 9.0, with this TIP, the combining of surrogates is
no longer handled automatically, only when the TCL_COMBINE
flag is used.
In Tcl 8.7, the TCL_COMBINE
flag will be defined too, but it
simply has the value 0
. This is meant as a help for
code written for Tcl 9.0, so it compiles/runs unchanged
in Tcl 8.7.
Implementation
Available in the tip-619
branch.
This branch targets 9.0.
Copyright
This document has been placed in the public domain.