Author: Jan Nijtmans <[email protected]> State: Draft Type: Project Vote: In progress Created: 23-Mar-2022 Tcl-Version: 8.7 Keywords: Tcl Tcl-Branch: full-utf-for-87
Although Tcl 8.7 understands quite a lot of Unicode, one thing it still cannot:
$ tclsh8.7 % string length 🤝 2
The reason for this is that - internally - sizeof(Tcl_UniChar) == 2, and
this is visible in the full Tcl C API.
"string length" counts UTF-16
code-points, not the number of Unicode characters. Changing that (defining
TCL_UTF_MAX = 4) means that the C-API would change behaviour. E.g.
Tcl_Obj *obj = Tcl_NewObj("🤝x"); int size = Tcl_UniCharLength(obj); Tcl_UniChar *uniCharString = Tcl_GetUnicode(obj); Tcl_UniChar *partOfTheString = Tcl_GetRange(obj, 2, 2);Since, in Tcl 8.6, the above example gives
size = 3,
uniCharStringbeing a 16-bit array and
partOfTheStringwill contain the value "x", we cannot change that in Tcl 8.7: Extensions depending on that, compiled against Tcl 8.6 headers, would lead to different behavior when loaded in Tcl 8.7. The C-API must stay binary compatible.
This TIP proposes to change TCL_UTF_MAX=4 internally, and create a UTF-16 compatibility layer of stub entries such that extensions won't be affected. This compatibility layer will only be implemented for Tcl 8.7, it won't be forward-merged to Tcl 9.0!
tcl.h, determine TCL_UTF_MAX as follows:
#ifndef TCL_UTF_MAX # ifdef BUILD_tcl # define TCL_UTF_MAX 4 # else # define TCL_UTF_MAX 3 # endif #endif
This means, that Tcl is built using UTF-32 internally. A new set of stub entries is created for 5 functions:
int TclNumUtfChars(const char *, int) int TclGetCharLength(Tcl_Obj *) const char *TclUtfAtIndex(const char *, int) Tcl_Obj *TclGetRange(Tcl_Obj *, int, int) int TclGetUniChar(Tcl_Obj *, int)Those 5 functions are used everywhere in Tcl, and those functions count in UTF-32. So
TclNumUtfChars("🤝", -1)will return
2as Tcl 8.6 does. But extensions using
Tcl_UtfAtIndexwill continue to use the original functions, which count UTF-16 characters.
Extensions which want to use the new UTF-32 functions, can define
Tcl_GetUniChar will be mapped to
Also, the following 3 functions which were deprecated in Tcl 8.7 (because they don't work well with UTF-32) are added to this compatibility layer:
int Tcl_UniCharNcmp(const Tcl_UniChar *, const Tcl_UniChar *, unsigned long); int Tcl_UniCharNcasecmp(const Tcl_UniChar *, const Tcl_UniChar *, unsigned long); int Tcl_UniCharCaseMatch(const Tcl_UniChar *, const Tcl_UniChar *, int);Those 3 functions are still deprecated (See TIP #542), but they are implemented in the UTF-16 compatibility layer for Tcl 8.7. In Tcl 9.0, they are gone.
"string" objType is renamed to
"utf32string", and a new
"string" objType is implemented which uses UTF-16 codepoints. This
objType is used in the compatibility layer of the following 5 functions:
Tcl_UniChar *Tcl_GetUnicode(Tcl_Obj *); Tcl_UniChar *Tcl_GetUnicodeFromObj(Tcl_Obj *, int *); Tcl_Obj *Tcl_NewUnicodeObj(Tcl_UniChar *, int); Tcl_SetUnicodeObj(Tcl_Obj *, Tcl_UniChar *, int); void Tcl_AppendUnicodeToObj(Tcl_Obj *, const Tcl_UniChar *, int)
If Tcl is compiled with
-DTCL_NO_DEPRECATED, then the UTF-16 compatibility
layer is removed. This is meant to verify that the compatibility layer is not
used internally anywhere. This also means that extensions using any of the
above API will Panic. Compiling the extension with
make the extension work again, but this is only meant for test
purposes, not for production!
Finally, this TIP proposes to undo the deprecation of
Although this was proposed in TIP #542, it turned out that this function
could not really be removed, it just moved to be a internal stub function in
Tcl 9.0. Therefore, there is no burden exposing it again.
Since - internally -
TCL_UTF_MAX is raised from 3 to 4, this influences
the behavior of the encoding/decoding functions. For example, the following code;
Tcl_EncodingState state; int read; Tcl_Encoding encoding = Tcl_Encoding("utf-8"); char buf[TCL_UTF_MAX] = ""; int result = Tcl_ExternalToUtf(interp, encoding, "🤝", 4, flags, &state, buf, sizeof(buf), &read, NULL, NULL);In Tcl 8.6, after doing this call,
bufwill be filled with the bytes 0xED 0xA0 0xBE, which is the cesu-8 representation of a high surrogate. The function
Tcl_ExternalToUtfin Tcl 8.6 is guaranteed to provide some output if the buffer provided has at least 3 bytes. In Tcl 8.7, buffers used for
Tcl_UtfToExternalneed at least 4 bytes, otherwise 4-byte utf-8 sequences cannot be handled. Therefore, in Tcl 8.7, the above example won't give any output (
read= 0), since the buffer cannot handle even a single unicode character.
This document has been placed in the public domain.