Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 23-Jan-2018
Post-History:
Discussions-To: Tcl Core list
Keywords: Tcl
Tcl-Version: 9.0
Tcl-Branch: tip-497
Vote-Summary: Accepted 6/0/1
Votes-For: DKF, FV, JN, KBK, KW, SL
Votes-Against: none
Votes-Present: DGP
Abstract
This TIP proposes to add full support for Unicode planes 1-16, which contain characters >= U+010000, inclusive in the regexp engine. Also, the caveats remaining from TIP #389 will be handled here.
Rationale
In Tcl 8.7, running [length "😊"]
gives the result 2. The reason for
this is that - internally - Tcl 8.x splits the character in two
surrogate characters, so they fit in a 16-bit character array.
This TIP (for Tcl 9.0) is meant to put an end to this workaround,
providing a real solution.
Specification
This document proposes:
Change the internal "string" objType, such that it will store 32-bit characters in stead of 16-bit characters internally.
No longer consider Emoji to consist of 2 surrogate characters. So
[length "😊"] == 1
Modify the Tcl_UniChar type, such that - by default -
sizeof(Tcl_UniChar)
== 4 (was: 2). All extensions assumingsizeof(Tcl_UniChar)
== 2 will be affected by this.
It would be possible to introduce an additional 16-bit character objType internally, this TIP doesn't implement or propose that. That could always be added later, but at this moment it is questionable whether that would give a serious advantage or not. Anyway, out of scope for this TIP.
Compatibility
For Tcl 8.x, sizeof(Tcl_UniChar)
== 2. All extensions making this
assumption will be affected, since for Tcl 9.0 that will not be true
any more. There are two ways extensions can be modified:
Either fix the places in the code where Tcl_UniChar is used in a non-portable way. The API introduced by TIP #548 can be used for that.
An alternative is compile the extension using
-DTCL_UTF_MAX=3
. This will restoresizeof(Tcl_UniChar) == 2
as visible for the extension. Almost all (non-deprecated) Tcl API's can still be used by the extension, even the ones usingTcl_UniChar
, except:Tcl_NewUnicodeObj()
,Tcl_SetUnicodeObj()
,Tcl_GetUnicodeFromObj()
,Tcl_GetUnicode()
. There is no need to re-compile Tcl using-DTCL_UTF_MAX=3
for the extension to work.
This change introduces a potential binary incompatibility compared to Tcl 9.0a3 and earlier: All extensions using any of the 4 following functions need to be recompiled:
Tcl_NewUnicodeObj()
Tcl_SetUnicodeObj()
Tcl_GetUnicodeFromObj()
Tcl_GetUnicode()
None of the (battery-included) extensions included by Tcl (e.g. Thread, tdbc ...) are affected. Extensions need to be aware that those 4 functions interface directly with the internal "string" objType, which stores 16-bit characters in Tcl 8.x and 32-bit characters in Tcl 9.x.
Reference Implementation
A reference implementation is available in the tip-497 branch. https://core.tcl-lang.org/tcl/timeline?r=tip-497
Copyright
This document has been placed in the public domain.