Author: Jan Nijtmans <[email protected]> State: Final Type: Project Vote: Done Created: 23-Jan-2018 Post-History: Discussions-To: Tcl Core list Keywords: Tcl Tcl-Version: 9.0 Tcl-Branch: tip-497 Vote-Summary: Accepted 6/0/1 Votes-For: DKF, FV, JN, KBK, KW, SL Votes-Against: none Votes-Present: DGP
This TIP proposes to add full support for Unicode planes 1-16, which contain characters >= U+010000, inclusive in the regexp engine. Also, the caveats remaining from TIP #389 will be handled here.
In Tcl 8.7, running
[length "😊"] gives the result 2. The reason for
this is that - internally - Tcl 8.x splits the character in two
surrogate characters, so they fit in a 16-bit character array.
This TIP (for Tcl 9.0) is meant to put an end to this workaround,
providing a real solution.
This document proposes:
Change the internal "string" objType, such that it will store 32-bit characters in stead of 16-bit characters internally.
No longer consider Emoji to consist of 2 surrogate characters. So
[length "😊"] == 1
Modify the Tcl_UniChar type, such that - by default -
sizeof(Tcl_UniChar)== 4 (was: 2). All extensions assuming
sizeof(Tcl_UniChar)== 2 will be affected by this.
It would be possible to introduce an additional 16-bit character objType internally, this TIP doesn't implement or propose that. That could always be added later, but at this moment it is questionable whether that would give a serious advantage or not. Anyway, out of scope for this TIP.
For Tcl 8.x,
sizeof(Tcl_UniChar) == 2. All extensions making this
assumption will be affected, since for Tcl 9.0 that will not be true
any more. There are two ways extensions can be modified:
Either fix the places in the code where Tcl_UniChar is used in a non-portable way. The API introduced by TIP #548 can be used for that.
An alternative is compile the extension using
-DTCL_UTF_MAX=3. This will restore
sizeof(Tcl_UniChar) == 2as visible for the extension. Almost all (non-deprecated) Tcl API's can still be used by the extension, even the ones using
Tcl_GetUnicode(). There is no need to re-compile Tcl using
-DTCL_UTF_MAX=3for the extension to work.
This change introduces a potential binary incompatibility compared to Tcl 9.0a3 and earlier: All extensions using any of the 4 following functions need to be recompiled:
None of the (battery-included) extensions included by Tcl (e.g. Thread, tdbc ...) are affected. Extensions need to be aware that those 4 functions interface directly with the internal "string" objType, which stores 16-bit characters in Tcl 8.x and 32-bit characters in Tcl 9.x.
A reference implementation is available in the tip-497 branch. https://core.tcl-lang.org/tcl/timeline?r=tip-497
This document has been placed in the public domain.