Tcl Source Code

UTF-8 String
Login
Bounty program for improvements to Tcl and certain Tcl packages.

Source Files

Public Interface

Private Interface

Directly Depends On Public Interface

Directly Depends On Private Interface of

Discussion

This module provides a set of utility functions to operate on Tcl_UniChar arrays and char arrays with content in the internal encoding of Tcl 8. This encoding is referred to as "Utf", but is more precisely Modified UTF-8 without surrogate pair support. Likewise the documentation refers to Tcl_UniChar arrays as "Unicode strings" though that term lacks precision. The storage in these arrays is in UCS-2. Many of the functions expose the categorizations of case and character classes defined by Unicode.

The tclWin32Dll.c file contains other routines not mentioned here, but they do not appear to belong in this function area. The routines Tcl_Win*To* that are mentioned here appear to be legacy routines only for Windows platforms we no longer support. They are nothing more than typecasts applied to other public routines.

The dependence on Parsing and Evaluation internals is only in Tcl_UtfBackslash which arguably belongs in parsing instead, and the use of TclIsSpaceProc which could easily be avoided. The dependence on Dynamic Strings is only in the exporting interfaces to Tcl_DString, also easily remedied. This module is well within reach of being entirely self contained.

The three private routines are likely good candidates for exporting. They have callers. In some cases perhaps they are preferred replacements for something already public.

The key flaw here is that UCS-2 fails to encode the complete set of Unicode codepoints. It encodes only the "Basic Multilingual Plane" (BMP). One binary compatible way to remedy that flaw is to convert from UCS-2 to UTF-16 in the same storage arrays. TIP 389 points in that direction.

One problem with that answer is that the key reason UCS-2 is used in the first place is because it is a fixed-width encoding, and that fact allows for more efficient string processing operations in finding a character at an index, extracting and substituting substrings, and computing lengths. (It is also used to efficiently and losslessly take in UTF-16 data from external sources, notably command line data on Windows.) On the contrary, UTF-16 is a variable length encoding so the advantage for those operations is at least diminished, if not lost. If we must manage a variable-length encoding anyway, we might as well just stick with one encoding for the simplicity that would imply. To the extent that might happen, many of the routines here would just become deprecated legacy support routines.

There is no fully specified answer for efficient implementation of the string operations currently performed using Tcl_UniChar arrays while keeping the data in Modified UTF-8. The "rope" data structure may be helpful, or other indexing systems marking out substrings to speed up random access to arbitrary indices.

Unicode normalization forms have not been addressed at all in Tcl's unicode string handling. In particular, canonically equivalent Unicode strings do not test as equal according to the string equal command. These complications also undermine the apparent efficiency value of a fixed-width encoding, decreasing the odds that the Tcl_UniChar will play an important role going forward.

The current BMP limitation of Tcl strings has had another consequence in its UTF-8 management. The longest byte sequences decoded into codepoints have length 3. Covering the entire set of Unicode codepoints requires us to support UTF-8 to up to 4 byte sequences (#define TCL_UTF_MAX 4).