# TIP 542:Support for switchable Full Unicode support
Author: Jan Nijtmans <[email protected]>
Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 10-May-2019
Post-History:
Discussions-To: Tcl Core list
Keywords: Tcl
Tcl-Version: 8.7
Tcl-Branch: utf-max
Vote-Summary: Accepted 6/0/0
Votes-For: FV, JN, KBK, KW, MC, SL
Votes-Against: none
Votes-Present: none
-----
# Abstract
This TIP proposes being able to switch Tcl between Full Unicode mode
(TCL\_UTF\_MAX>3, almost compatible with Androwish) and current partial
Unicode mode (as far as TIP #389 goes, using `TCL_UTF_MAX=3`)
Full Unicode mode means that \[string length \\U10000\] will no longer
return `2` but the expected `1`, so Unicode characters larger than \\uFFFF
will no longer internally be split into two surrogate characters.
# Rationale
Tcl currently can be compiled in 3 different modes: using TCL\_UTF\_MAX=3, TCL\_UTF\_MAX=4
or TCL\_UTF\_MAX=6. The first 2 are actually equal now in Tcl 8.7 (since TIP #389). Using
TCL\_UTF\_MAX=6 is actually overkill, since no utf-8 character consists of more than 4 bytes.
Therefore it makes sense to reduce this to only two modes: TCL\_UTF\_MAX=3 means
being fully compatible with Tcl 8.6, while TCL\_UTF\_MAX=4 means compatibility with
the Androwish-version of Tcl. Defining TCL\_UTF\_MAX=6 results in a valid
compilation as well (functioning the same as TCL\_UTF\_MAX=4), only some buffer-sizes
will be 2 bytes larger than necessary.
Androwish made the choice to use an (at that time) un-supported Tcl mode: Changing the size
of the Tcl\_UniChar type using TCL\_UTF\_MAX=6. This causes a binary incompatibility
which results that all extensions need to be re-compiled with TCL\_UTF\_MAX=6 as well.
This TIP proposes to add a supported TCL\_UTF\_MAX=4 compilation mode to Tcl, which has
the same effect as the earlier unsupported TCL\_UTF\_MAX=6, but without the need to
re-compile all extensions.
There are 4 functions which expose the internal structure of Tcl strings to extensions,
functions that are rarely used in extensions. Those are
`Tcl_UniCharCaseMatch()`, `Tcl_UniCharLen()`, `Tcl_UniCharNcmp()` and `Tcl_UniCharNcasecmp()`.
They all have a UTF-8 equivalent which can be used instead. Switching TCL\_UTF\_MAX
to a different value causes a binary incompatibility, but it's not worth to do
anything about it. That's why this TIP proposes to deprecate them, and have them
removed completely in Tcl 9.0 (actually: they will become MODULE\_SCOPE internal
functions in 9.0).
The default compilation mode for Tcl 8.x will continue to be TCL\_UTF\_MAX=3, which is 100%
upwards compatible with Tcl 8.6.
This TIP opens the room to be able to switch Tcl using TCL\_UTF\_MAX=4 eventually for
Tcl 9.0. More work needs to be done for that, but that's stuff for TIP #497. This TIP
describes how far we can go for Tcl 8.x without causing unacceptable binary incompatibilities.
# Specification
This document proposes:
* Allow Tcl to be compiled with either -DTCL\_UTF\_MAX=3 (default), or with -DTCL\_UTF\_MAX=4.
In the latter mode, the Tcl_UniChar type becomes a 32-bit type.
* Allow Tcl extensions to be compiled with either -DTCL\_UTF\_MAX=3 (default), or with -DTCL\_UTF\_MAX=4,
independent from how Tcl is compiled.
* Deprecate the following functions:
Tcl\_UniCharCaseMatch()
Tcl\_UniCharLen()
Tcl\_UniCharNcmp()
Tcl\_UniCharNcasecmp()
If Tcl is compiled with either -DTCL\_UTF\_MAX=4 or -DTCL\_NO\_DEPRECATED, those functions will no longer be available for extensions.
In Tcl 9.0 those 5 functions will be completely gone.
* Deprecate the following functions when Tcl is compiled in a different -DTCL\_UTF\_MAX mode than the extension.
Tcl\_GetUnicode()
Tcl\_GetUnicodeFromObj()
Tcl\_NewUnicodeObj()
Tcl\_SetUnicodeObj()
Tcl\_AppendUnicodeToObj()
If Tcl is compiled with a different TCL\_UTF\_MAX value than the extension, those 5 functions cannot be used.
In order to keep binary compatibility with 8.6, those 5 functions are not supported for extensions compiled
with -DTCL\_UTF\_MAX=4. You can use the functions in TIP #548 instead, for converting between Tcl_UniChar
and UTF-8.
# Compatibility
As long as Tcl is compiled with -DTCL\_UTF\_MAX=3, this is fully upwards compatible.
When Tcl is compiled with -DTCL\_UTF\_MAX=4, this is at the Tcl level, compatible with the Androwish-version
of Tcl. At the C-API level, it's upwards compatible with Tcl 8.6 in TCL\_UTF\_MAX=6 mode, except for the
functions mentioned above.
# Later change
The original version of this TIP (as voted upon), proposed to fully deprecate the
Tcl\_AppendUnicodeToObj(). However the later [TIP #622](622.md) changed that: Only
deprecate Tcl\_AppendUnicodeToObj() when Tcl is compiled in a different -DTCL\_UTF\_MAX
mode than the extension.
# Caveats
* Extensions compiled with -DTCL\_UTF\_MAX=4 cannot use any of the deprecated functions mentioned in this TIP.
Using any of them results in a link error.
* If Tcl is compiled with -DTCL\_UTF\_MAX=4, the deprecated functions will be removed from the stub-table. Any
extension using those, even if the extension is compiled with -DTCL\_UTF\_MAX=3, won't work any more: This
results in a panic clearly indicating the problem.
Remark: All extensions bundled with Tcl (and Tk 8.7a3 as well) are already modified not to use any of those
deprecated functions any more. So they can be compiled with any TCL\_UTF\_MAX value, independent on the
TCL\_UTF\_MAX value Tcl was compiled with.
# Reference Implementation
A reference implementation is available in the **utf-max** branch.
<https://core.tcl-lang.org/tcl/timeline?r=utf-max>
# Copyright
This document has been placed in the public domain.