TIP 619: New TCL_COMBINE flag for Tcl_UniCharToUtf()

Login
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Post-History:   
Keywords:       Tcl Tcl_UniCharToUtf()
Tcl-Version:    9.0
Tcl-Branch:     tip-619
Vote-Summary:   Accepted 4/0/1
Votes-For:      AK, JN, KW, SL
Votes-Against:  none
Votes-Present:  FV

Abstract

This TIP proposes a new flag TCL_COMBINE to be used for the function Tcl_UniCharToUtf(). With this flag, the function Tcl_UniCharToUtf() will try to combine surrogates (code range UD800 - UDFFF), otherwise surrogates are handled as if they were valid codepoints.

This TIP is designed as a solution for this ticket: "Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16". Currently, the internal use of UTF-16 is visible in some strange behavior at script level: Whenever a high surrogate character is glued together with a low surrogate character (which - individually - would be illegal), they form a valid Unicode character. This TIP is written to get rid of that strange behavior in Tcl 9.0. All places in the code where this special behavior is handled, it won't be handled this special way any more: surrogate characters are illegal, but they can be handled - internally - by Tcl 9.0. They won't be combined and - magically - become valid any more.

Rationale

In Tcl 9.0 currently:

    $ tclsh9.0
    % string length \uD83D\uDE02
    1
But
    $ tclsh9.0
    % string length \uD83D\uD83D
    2
Whenever surrogate pairs are encountered in a Tcl script, the surrogates are automatically combined into a single Unicode character. This cannot not be changed in Tcl 8.7 for three reasons:

Those 3 restrictions are gone in Tcl 9.0.

This TIP proposes that the function Tcl_UniCharToUtf() no longer combines surrogates by default, as it did in Tcl 8.x. Usages of the TCL_COMBINE flag makes Tcl_UniCharToUtf() behave as in Tcl 8.x again. Since surrogates are illegal in UTF-8, the effect of this TIP will be:

$ tclsh9.0
% string length \uD83D\uDE02
2
% puts \uD83D\uDE02
error writing "stdout": illegal byte sequence
% encoding convertto utf-8 \uD83D\uDE02
unexpected character at index 0: 'U+00D83D'

If you really want to output invalid UTF-8, you can use the cesu-8 encoding:

    % encoding convertto cesu-8 \uD83D\uDE02
    😊    % fconfigure stdout -encoding cesu-8
    % puts \uD83D\uDE02
    😂

The -nocomplain option won't give the desired effect in this case:

    % encoding convertto -nocomplain utf-8 \uD83D\uDE02
    ��
Note that � is the UTF-8-encoded form of the replacement character

Extensions using Tcl_UniCharToUtf() in Tcl 8.7 always encounter the behaviour that calling Tcl_UniCharToUtf() with a lower surrogate followed by another Tcl_UniCharToUtf() call with a higher surrogate will output a 4-byte UTF-8 sequence: The first call outputs a single byte, the second call outputs the remaining 3 bytes. So, combining the two surrogates is handled internally, and it cannot be switched off. In Tcl 9.0, with this TIP, the combining of surrogates is no longer handled automatically, only when the TCL_COMBINE flag is used.

In Tcl 8.7, the TCL_COMBINE flag will be defined too, but it simply has the value 0. This is meant as a help for code written for Tcl 9.0, so it compiles/runs unchanged in Tcl 8.7.

Implementation

Available in the tip-619 branch.

This branch targets 9.0.

Copyright

This document has been placed in the public domain.