TIP 619: New TCL_COMBINE flag for Tcl_UniCharToUtf()

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Post-History:   
Keywords:       Tcl Tcl_UniCharToUtf()
Tcl-Version:    9.0
Tcl-Branch:     tip-619

Abstract

This TIP proposes a new flag TCL_COMBINE to be used for the function Tcl_UniCharToUtf(). With this flag, the function Tcl_UniCharToUtf() will try to combine surrogates (code range UD800 - UDFFF), otherwise surrogates are handled as if they were valid codepoints.

Rationale

In Tcl 9.0 currently:

    $ tclsh9.0
    % string length \uD83D\uDE02
    1
But
    $ tclsh9.0
    % string length \uD83D\uD83D
    2
Whenever surrogate pairs are encountered in a Tcl script, the surrogates are automatically combined into a single Unicode character. This cannot not be changed in Tcl 8.7 for three reasons:

Those 3 restrictions are gone in Tcl 9.0.

This TIP proposes that the function Tcl_UniCharToUtf() no longer combines surrogates by default, as it did in Tcl 8.x. Usages of the TCL_COMBINE flag makes Tcl_UniCharToUtf() behave as in Tcl 8.x again. Since surrogates are illegal in UTF-8, the effect of this TIP will be:

$ tclsh9.0
% string length \uD83D\uDE02
2
% puts \uD83D\uDE02
error writing "stdout": illegal byte sequence
% encoding convertto utf-8 \uD83D\uDE02
unexpected character at index 0: 'U+00D83D'

If you really want to output invalid UTF-8, you can use the cesu-8 encoding:

    % encoding convertto cesu-8 \uD83D\uDE02
    😊    % fconfigure stdout -encoding cesu-8
    % puts \uD83D\uDE02
    😂

The -nocomplain option won't give the desired effect in this case:

    % encoding convertto -nocomplain utf-8 \uD83D\uDE02
    ��
Note that � is the UTF-8-encoded form of the replacement character

Extensions using Tcl_UniCharToUtf() in Tcl 8.7 always encounter the behaviour that calling Tcl_UniCharToUtf() with a lower surrogate followed by another Tcl_UniCharToUtf() call with a higher surrogate will output a 4-byte UTF-8 sequence: The first call outputs a single byte, the second call outputs the remaining 3 bytes. So, combining the two surrogates is handled internally, and it cannot be switched off. In Tcl 9.0, with this TIP, the combining of surrogates is no longer handled automatically, only when the TCL_COMBINE flag is used.

In Tcl 8.7, the TCL_COMBINE flag will be defined too, but it simply has the value 0. This is meant as a help for code written for Tcl 9.0, so it compiles/runs unchanged in Tcl 8.7.

Implementation

Available in the tip-619 branch.

This branch targets 9.0.

Copyright

This document has been placed in the public domain.