TIP 622: Full Unicode for Tcl 8.7

Login
Bounty program for improvements to Tcl and certain Tcl packages.
    Author:        Jan Nijtmans <[email protected]>
    State:         Draft
    Type:          Project
    Vote:          In progress
    Created:       23-Mar-2022
    Tcl-Version:   8.7
    Keywords:      Tcl
    Tcl-Branch:    full-utf-for-87

Abstract

Although Tcl 8.7 understands quite a lot of Unicode, one thing it still cannot:

$ tclsh8.7
% string length 🤝
2

The reason for this is that - internally - sizeof(Tcl_UniChar) == 2, and this is visible in the full Tcl C API. "string length" counts UTF-16 code-points, not the number of Unicode characters. Changing that (defining TCL_UTF_MAX = 4) means that the C-API would change behaviour. E.g.

    Tcl_Obj *obj = Tcl_NewObj("🤝x");
    int size = Tcl_UniCharLength(obj);
    Tcl_UniChar *uniCharString = Tcl_GetUnicode(obj);
    Tcl_UniChar *partOfTheString = Tcl_GetRange(obj, 2, 2);
Since, in Tcl 8.6, the above example gives size = 3, uniCharString being a 16-bit array and partOfTheString will contain the value "x", we cannot change that in Tcl 8.7: Extensions depending on that, compiled against Tcl 8.6 headers, would lead to different behavior when loaded in Tcl 8.7. The C-API must stay binary compatible.

This TIP proposes to change TCL_UTF_MAX=4 internally, and create a UTF-16 compatibility layer of stub entries such that extensions won't be affected. This compatibility layer will only be implemented for Tcl 8.7, it won't be forward-merged to Tcl 9.0!

Specification

In tcl.h, determine TCL_UTF_MAX as follows:

    #ifndef TCL_UTF_MAX
    #   ifdef BUILD_tcl
    #	define TCL_UTF_MAX		4
    #   else
    #	define TCL_UTF_MAX		3
    #   endif
    #endif

This means, that Tcl is built using UTF-32 internally. A new set of stub entries is created for 5 functions:

    int TclNumUtfChars(const char *, int)
    int TclGetCharLength(Tcl_Obj *)
    const char *TclUtfAtIndex(const char *, int)
    Tcl_Obj *TclGetRange(Tcl_Obj *, int, int)
    int TclGetUniChar(Tcl_Obj *, int)
Those 5 functions are used everywhere in Tcl, and those functions count in UTF-32. So TclNumUtfChars("🤝", -1) will return 1, not 2 as Tcl 8.6 does. But extensions using Tcl_NumUtfChars/Tcl_GetCharLength/Tcl_UtfAtIndex will continue to use the original functions, which count UTF-16 characters.

Extensions which want to use the new UTF-32 functions, can define TCL_UTF_MAX=4 before including tcl.h, then Tcl_NumUtfChars/Tcl_GetCharLength/Tcl_UtfAtIndex/ Tcl_GetRange/Tcl_GetUniChar will be mapped to TclNumUtfChars/ TclGetCharLength/TclUtfAtIndex/TclGetRange/TclGetUniChar.

Also, the following 3 functions which were deprecated in Tcl 8.7 (because they don't work well with UTF-32) are added to this compatibility layer:

    int Tcl_UniCharNcmp(const Tcl_UniChar *, const Tcl_UniChar *, unsigned long);
    int Tcl_UniCharNcasecmp(const Tcl_UniChar *, const Tcl_UniChar *, unsigned long);
    int Tcl_UniCharCaseMatch(const Tcl_UniChar *, const Tcl_UniChar *, int);
Those 3 functions are still deprecated (See TIP #542), but they are implemented in the UTF-16 compatibility layer for Tcl 8.7. In Tcl 9.0, they are gone.

Finally, the "string" objType is renamed to "utf32string", and a new "string" objType is implemented which uses UTF-16 codepoints. This objType is used in the compatibility layer of the following 5 functions:

    Tcl_UniChar *Tcl_GetUnicode(Tcl_Obj *);
    Tcl_UniChar *Tcl_GetUnicodeFromObj(Tcl_Obj *, int *);
    Tcl_Obj *Tcl_NewUnicodeObj(Tcl_UniChar *, int);
    Tcl_SetUnicodeObj(Tcl_Obj *, Tcl_UniChar *, int);
    void Tcl_AppendUnicodeToObj(Tcl_Obj *, const Tcl_UniChar *, int)

If Tcl is compiled with -DTCL_NO_DEPRECATED, then the UTF-16 compatibility layer is removed. This is meant to verify that the compatibility layer is not used internally anywhere. This also means that extensions using any of the above API will Panic. Compiling the extension with -DTCL_UTF_MAX=4 will make the extension work again, but this is only meant for test purposes, not for production!

Finally, this TIP proposes to undo the deprecation of Tcl_AppendUnicodeToObj. Although this was proposed in TIP #542, it turned out that this function could not really be removed, it just moved to be a internal stub function in Tcl 9.0. Therefore, there is no burden exposing it again.

Caveat

Since - internally - TCL_UTF_MAX is raised from 3 to 4, this influences the behavior of the encoding/decoding functions. For example, the following code;

    Tcl_EncodingState state;
    int read;
    Tcl_Encoding encoding = Tcl_Encoding("utf-8");
    char buf[TCL_UTF_MAX] = "";
    int result = Tcl_ExternalToUtf(interp, encoding, "🤝",
	    4, flags, &state, buf,
	    sizeof(buf), &read, NULL, NULL);
In Tcl 8.6, after doing this call, buf will be filled with the bytes 0xED 0xA0 0xBE, which is the cesu-8 representation of a high surrogate. The function Tcl_ExternalToUtf in Tcl 8.6 is guaranteed to provide some output if the buffer provided has at least 3 bytes. In Tcl 8.7, buffers used for Tcl_ExternalToUtf or Tcl_UtfToExternal need at least 4 bytes, otherwise 4-byte utf-8 sequences cannot be handled. Therefore, in Tcl 8.7, the above example won't give any output (read = 0), since the buffer cannot handle even a single unicode character.

Implementation

See branch full-utf-for-87

Copyright

This document has been placed in the public domain.