TIP 497: Full support for Unicode planes 1-16.

Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        23-Jan-2018
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    9.0
Tcl-Branch:     tip-497
Vote-Summary:	Accepted 6/0/1
Votes-For:		DKF, FV, JN, KBK, KW, SL
Votes-Against:	none
Votes-Present:	DGP


This TIP proposes to add full support for Unicode planes 1-16, which contain characters >= U+010000, inclusive in the regexp engine. Also, the caveats remaining from TIP #389 will be handled here.


In Tcl 8.7, running [length "😊"] gives the result 2. The reason for this is that - internally - Tcl 8.x splits the character in two surrogate characters, so they fit in a 16-bit character array. This TIP (for Tcl 9.0) is meant to put an end to this workaround, providing a real solution.


This document proposes:

It would be possible to introduce an additional 16-bit character objType internally, this TIP doesn't implement or propose that. That could always be added later, but at this moment it is questionable whether that would give a serious advantage or not. Anyway, out of scope for this TIP.


For Tcl 8.x, sizeof(Tcl_UniChar) == 2. All extensions making this assumption will be affected, since for Tcl 9.0 that will not be true any more. There are two ways extensions can be modified:

This change introduces a potential binary incompatibility compared to Tcl 9.0a3 and earlier: All extensions using any of the 4 following functions need to be recompiled:

None of the (battery-included) extensions included by Tcl (e.g. Thread, tdbc ...) are affected. Extensions need to be aware that those 4 functions interface directly with the internal "string" objType, which stores 16-bit characters in Tcl 8.x and 32-bit characters in Tcl 9.x.

Reference Implementation

A reference implementation is available in the tip-497 branch. https://core.tcl-lang.org/tcl/timeline?r=tip-497


This document has been placed in the public domain.