Source Files

Public Interface

Tcl_UniCharToUtf
Tcl_UniCharToUtfDString
Tcl_UtfToUniChar
Tcl_UtfToUniCharDString
Tcl_UtfCharComplete
Tcl_NumUtfChars
Tcl_UtfFindFirst
Tcl_UtfFindLast
Tcl_UtfNext
Tcl_UtfPrev
Tcl_UniCharAtIndex
Tcl_UtfAtIndex
Tcl_UtfBackslash
Tcl_UtfToUpper
Tcl_UtfToLower
Tcl_UtfToTitle
Tcl_UtfNcmp
Tcl_UtfNcasecmp
Tcl_UniCharToUpper
Tcl_UniCharToLower
Tcl_UniCharToTitle
Tcl_UniCharLen
Tcl_UniCharNcmp
Tcl_UniCharNcasecmp
Tcl_UniCharIsAlnum
Tcl_UniCharIsAlpha
Tcl_UniCharIsControl
Tcl_UniCharIsDigit
Tcl_UniCharIsGraph
Tcl_UniCharIsLower
Tcl_UniCharIsPrint
Tcl_UniCharIsPunct
Tcl_UniCharIsSpace
Tcl_UniCharIsUpper
Tcl_UniCharIsWordChar
Tcl_UniCharCaseMatch
Tcl_WinUtfToTChar
Tcl_WinTCharToUtf

Private Interface

TclUtfCasecmp
TclUniCharMatch
TclpUtfNcmp2

Directly Depends On Public Interface

Dynamic Strings

Directly Depends On Private Interface of

Parsing and Evaluation

Discussion

This module provides a set of utility functions to operate on Tcl_UniChar arrays and char arrays with content in the internal encoding of Tcl 8. This encoding is referred to as "Utf", but is more precisely Modified UTF-8 without surrogate pair support. Likewise the documentation refers to Tcl_UniChar arrays as "Unicode strings" though that term lacks precision. The storage in these arrays is in UCS-2. Many of the functions expose the categorizations of case and character classes defined by Unicode.

The tclWin32Dll.c file contains other routines not mentioned here, but they do not appear to belong in this function area. The routines Tcl_Win*To* that are mentioned here appear to be legacy routines only for Windows platforms we no longer support. They are nothing more than typecasts applied to other public routines.

The dependence on Parsing and Evaluation internals is only in Tcl_UtfBackslash which arguably belongs in parsing instead, and the use of TclIsSpaceProc which could easily be avoided. The dependence on Dynamic Strings is only in the exporting interfaces to Tcl_DString, also easily remedied. This module is well within reach of being entirely self contained.

The three private routines are likely good candidates for exporting. They have callers. In some cases perhaps they are preferred replacements for something already public.

The key flaw here is that UCS-2 fails to encode the complete set of Unicode codepoints. It encodes only the "Basic Multilingual Plane" (BMP). One binary compatible way to remedy that flaw is to convert from UCS-2 to UTF-16 in the same storage arrays. TIP 389 points in that direction.

One problem with that answer is that the key reason UCS-2 is used in the first place is because it is a fixed-width encoding, and that fact allows for more efficient string processing operations in finding a character at an index, extracting and substituting substrings, and computing lengths. (It is also used to efficiently and losslessly take in UTF-16 data from external sources, notably command line data on Windows.) On the contrary, UTF-16 is a variable length encoding so the advantage for those operations is at least diminished, if not lost. If we must manage a variable-length encoding anyway, we might as well just stick with one encoding for the simplicity that would imply. To the extent that might happen, many of the routines here would just become deprecated legacy support routines.

There is no fully specified answer for efficient implementation of the string operations currently performed using Tcl_UniChar arrays while keeping the data in Modified UTF-8. The "rope" data structure may be helpful, or other indexing systems marking out substrings to speed up random access to arbitrary indices.

Unicode normalization forms have not been addressed at all in Tcl's unicode string handling. In particular, canonically equivalent Unicode strings do not test as equal according to the string equal command. These complications also undermine the apparent efficiency value of a fixed-width encoding, decreasing the odds that the Tcl_UniChar will play an important role going forward.