Author: Jan Nijtmans <[email protected]> Author: Jan Nijtmans <[email protected]> State: Draft Type: Project Vote: Pending Created: 10-May-2019 Post-History: Discussions-To: Tcl Core list Keywords: Tcl Tcl-Version: 8.7 Tcl-Branch: utf-max
This TIP proposes being able to switch Tcl between Full Unicode mode
(TCL_UTF_MAX>3, almost compatible with Androwish) and current partial
Unicode mode (as far as TIP #389 goes, using
Full Unicode mode means that [string length \U10000] will no longer
2 but the expected
1, so Unicode characters larger than \uFFFF
will no longer internally be split into two surrogate characters.
Tcl currently can be compiled in 3 different modes: using TCL_UTF_MAX=3, TCL_UTF_MAX=4 or TCL_UTF_MAX=6. The first 2 are actually equal now in Tcl 8.7 (since TIP #389). Using TCL_UTF_MAX=6 is actually overkill, since no utf-8 character consists of more than 4 bytes.
Therefore it makes sense to reduce this to only two modes: TCL_UTF_MAX=3 means being fully compatible with Tcl 8.6, while TCL_UTF_MAX=4 means compatibility with the Androwish-version of Tcl. Defining TCL_UTF_MAX=6 results in a valid compilation as well (functioning the same as TCL_UTF_MAX=4), only some buffer-sizes will be 2 bytes larger than necessary.
Androwish made the choice to use an (at that time) un-supported Tcl mode: Changing the size of the Tcl_UniChar type using TCL_UTF_MAX=6. This causes a binary incompatibility which results that all extensions need to be re-compiled with TCL_UTF_MAX=6 as well. This TIP proposes to add a supported TCL_UTF_MAX=4 compilation mode to Tcl, which has the same effect as the earlier unsupported TCL_UTF_MAX=6, but without the need to re-compile all extensions.
There are 5 functions which expose the internal structure of Tcl strings to extensions,
functions that are rarely used in extensions. Those are
They all have a UTF-8 equivalent which can be used instead. Switching TCL_UTF_MAX
to a different value causes a binary incompatibility, but it's not worth to do
anything about it. That's why this TIP proposes to deprecate them, and have them
removed completely in Tcl 9.0.
The default compilation mode for Tcl 8.x will continue to be TCL_UTF_MAX=3, which is 100% upwards compatible with Tcl 8.6.
This TIP opens the room to be able to switch Tcl using TCL_UTF_MAX=4 eventually for Tcl 9.0. More work needs to be done for that, but that's stuff for TIP #497. This TIP describes how far we can go for Tcl 8.x without causing unacceptable binary incompatibilities.
This document proposes:
Allow Tcl to be compiled with either -DTCL_UTF_MAX=3 (default), or with -DTCL_UTF_MAX=4. In the latter mode, the Tcl_UniChar type becomes a 32-bit type.
Allow Tcl extensions to be compiled with either -DTCL_UTF_MAX=3 (default), or with -DTCL_UTF_MAX=4, independent from how Tcl is compiled.
Deprecate the following functions:
If Tcl is compiled with either -DTCL_UTF_MAX=4 or -DTCL_NO_DEPRECATED, those functions will no longer be available for extensions. In Tcl 9.0 those 5 functions will be completely gone.
Deprecate the following functions when Tcl is compiled in a different -DTCL_UTF_MAX mode than the extension.
If Tcl is compiled with a different TCL_UTF_MAX value than the extension, those 4 functions cannot be used. In order to keep binary compatibility with 8.6, those 4 functions are not supported for extensions compiled with -DTCL_UTF_MAX=4. You can use the functions in TIP #548 instead, for converting between Tcl_UniChar and UTF-8.
As long as Tcl is compiled with -DTCL_UTF_MAX=3, this is fully upwards compatible.
When Tcl is compiled with -DTCL_UTF_MAX=4, this is at the Tcl level, compatible with the Androwish-version of Tcl. At the C-API level, it's upwards compatible with Tcl 8.6 in TCL_UTF_MAX=6 mode, except for the functions mentioned above.
Extensions compiled with -DTCL_UTF_MAX=4 cannot use any of the deprecated functions mentioned in this TIP. Using any of them results in a link error.
If Tcl is compiled with -DTCL_UTF_MAX=4, the deprecated functions will be removed from the stub-table. Any extension using those, even if the extension is compiled with -DTCL_UTF_MAX=3, won't work any more: This results in a panic clearly indicating the problem.
Remark: All extensions bundled with Tcl (and Tk 8.7a3 as well) are already modified not to use any of those deprecated functions any more. So they can be compiled with any TCL_UTF_MAX value, independent on the TCL_UTF_MAX value Tcl was compiled with.
A reference implementation is available in the utf-max branch. https://core.tcl-lang.org/tcl/timeline?r=utf-max
This document has been placed in the public domain.