Author:		Ashok P. Nadkarni <[email protected]>
State:		Draft
Type:		Project
Vote:		Pending
Created:	2025-06-24
Tcl-Version:	9.1
Tcl-Branch:	tip-726
Keywords:	strings Unicode

Abstract

This TIP proposes commands and a C API for normalization of Unicode character strings.

Specification

Commands

This TIP introduces a new command ensemble, unicode, with the following commands:

unicode tonfc ?-profile PROFILE? STRING
unicode tonfd ?-profile PROFILE? STRING
unicode tonfkc ?-profile PROFILE? STRING
unicode tonfkd ?-profile PROFILE? STRING

The tonfc, tonfd, tonfkc, tonfkd subcommands convert the passed STRING to the Unicode normalization forms Normalization Form C (NFC), Normalization Form D (NFD), Normalization Form KC (NFKC) and Normalization Form KD (NFKD) respectively. For the definition of these forms as see Section 3.11 in Chapter 3 of the Unicode standard.

The -profile option has the same semantics as that in Tcl's encoding command but will only accept strict (default) and replace as valid values.

C API

The corresponding C level API is defined as

typedef enum {
    TCL_NFC, TCL_NFD, TCL_NFKC, TCL_NFKD
} Tcl_UnicodeNormalizationForm;
const char *Tcl_UtfToNormalizedDString(
    Tcl_Interp *,
    const char *sourceString,
    Tcl_UnicodeNormalizationForm normalizationForm,
    int  profile,
    Tcl_DString *dsPtr);

The sourceString operand must be in Tcl's internal UTF-8 format.

The normalizationForm argument specifies the Unicode normalization form. The profile argument must be one of TCL_ENCODING_PROFILE_STRICT or TCL_ENCODING_PROFILE_REPLACE and has the same semantics as the encoding routines.

The normalized string is stored in dsPtrin Tcl's internal UTF-8 format. The function initializes *dsPtr. Caller must call Tcl_DStringFree on success.

On success, the function returns a pointer to the normalized string stored in dsPtr. On error, a NULL pointer is returned.

Implementation library and license

The open source utf8proc library forms the core of the implementation. This will be statically linked into Tcl. The open source license for the library, reproduced below, is considered compatible with Tcl's standard license without imposing additional terms.

Discussion

Alternative syntax

Some variations of the syntax were discussed.

Suggestion To avoid name clashes with existing extensions the unicode ensemble may be placed under the tcl namespace.

Suggestion

Instead of multiple commands, a single normalize command was initially proposed.

unicode normalize ?-mode MODE? ?-profile PROFILE? STRING

However, the multiple command syntax was preferred for succintness.

Suggestion

Several people have also suggested making the commands part of the string ensemble, for example string normalize or string tonfc etc.. This is a reasonable option but the author's (mild) preference for a separate command ensemble stems from a desire to separate commands whose domain of functionality is strictly Unicode code points, i.e. the string command, from commands that understand Unicode abstract characters, i.e. the proposed unicode command. In future proposals, the command would be expanded with additional functionality related to collation, case folding, glyph boundaries etc. (In a sense the binary ensemble is not part of string for similar reasons -- different conceptual domain).

Having said that, if the majority feel `string** would be appropriate to host the new commands, no strong objection from the author.

Implementation choices

This section is really the main purpose of this TIP. There are several possible implementation strategies that are listed here along with their pros and cons. No specific recommendation is made at this time pending discussion of this TIP. Suggestions welcome for alternatives not mentioned below.

Criteria for selection include

Supported features. While this TIP only focuses on normalization, some consideration might be due for future Unicode related enhancements such as collation, boundary rules, paragraph layouts etc.
Cross-platform availability and compatibility. Any core Tcl features need to be available on all supported platforms and ideally produce the same output.
Engineering effort and ease of integration including future maintenance.
Performance and code size considerations.

Implement enhancements to Tcl's Unicode modules.

Tcl already has the data tables for some Unicode character classes such as numerics, whitespace etc. but as far as I know does not include the information needed for normalization. Further, the rules and algorithms for normalization are not trivial to put it mildly so unless someone commits to implement and maintain for the future, I do not consider this a viable option.

Bind to the ICU library

The ICU C library is the most widely used cross-platform library as well as the one with the most features. Bindings for normalization already exist in the tcl::unsupported namespace and some features such as word breaks are already being used by Tk if available so minimal (but not zero) further effort is required inclusion.

There are some potential drawbacks to ICU:

The implementation of ICU itself (not the bindings) is complex and it is unlikely to be suitable for direct inclusion into the Tcl code base. The implication is that Tcl has to rely for the presence of the library on the target platforms. While most major platforms have ICU installed, this is not universally true. For example, older editions of Windows 10, do not. Certain embedded environments etc. may not have be capable of supporting ICU at all.
A lesser issue is that the layout of ICU binaries differs by platform and ICU version. The current implementation of Tcl ICU bindings (stems from Tk) does a search at runtime and then dynamically looks up the function entry points. Some work is required in the build system to fix this. Quite possibly, not much work but the author views anything to do with autoconf with trepidation.
An even lesser issue is that from a performance perspective, ICU interfaces are primarily UTF-16 and calls require a translation step from Tcl 9's internal UTF-8 and UTF-32 formats.

Bind to the ICU4X library

The ICU4X library is a newer reimplementation of most features of ICU (notably not the encoders). It is (purportedly) smaller, faster and more modular than ICU. The implementation is written in Rust but as a shared library it is callable from C and the author has a private binding from Tcl as a proof of concept.

The potential drawbacks are:

Implementation in Rust means it is unlikely to ever ship as part of Tcl and will have to rely on availability through package managers on target platforms. For Windows, we will probably have to ship pre-built binaries.
Being relatively new, it is not as widely available as ICU as part of package managers.

Statically link to the utf8proc library

The utf8proc library is UTF-8 processing library used in the Julia language that provides a very limited set of functions compared to ICU or ICU4X. Its primary advantage over ICU or ICU4X is its small size in relative terms (~360KB binary) and ease of integration into Tcl because of its MIT license and C99 implementation. It allows provision of cross-platform normalization functions independent of the presence of system libraries. A minor benefit is that its API also uses 32-bit type to represent code points as in Tcl.

The main drawback is that it offers very limited functionality by design.

A Tcl binding is available as an extension.

Proposed implementation

For the 9.1 release, it is proposed to use the utf8proc library to add support for normalization.

It is easily integrated within the Tcl core.
It does not rely on any system libraries and will be available on all platforms that can run Tcl.
Its limited functionality is irrelevant at this time because adding support for ICU features it does not support will require far more extensive changes to Tcl and Tk. If those features are ever added to Tcl/Tk, the library can be trivially removed.

Note the current use of ICU within tcl::unsupported::icu will continue as it provides other functionality as sniffing of encodings.

If the above is not acceptable or desirable for any reason, the second choice would be to promote the tcl::unsupported::icu normalize command as a supported command with the caveat that it will only be available on platforms where ICU is present.

Reliance on system libraries means older implementations, compatibility changes but compatible with other applications.

utf8proc library license

The license for the utf8proc library is reproduced below.

Start of utf8proc license

utf8proc license

utf8proc is a software package originally developed by Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this library, that is now maintained by the Julia-language developers. Like the original utf8proc, whose copyright and license statements are reproduced below, all new work on the utf8proc library is licensed under the MIT "expat" license:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Original utf8proc license

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Unicode data license

This software contains data (utf8proc_data.c) derived from processing the Unicode data files. The following license applies to that data:

COPYRIGHT AND PERMISSION NOTICE

Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and any associated documentation (the "Data Files") or Unicode software and any associated documentation (the "Software") to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that (a) the above copyright notice(s) and this permission notice appear with all copies of the Data Files or Software, (b) both the above copyright notice(s) and this permission notice appear in associated documentation, and (c) there is clear notice in each modified Data File or in the Software as well as in the documentation associated with the Data File(s) or Software that the data or software has been modified.

THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in these Data Files or Software without prior written authorization of the copyright holder.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and may be registered in some jurisdictions. All other trademarks and registered trademarks mentioned herein are the property of their respective owners.

End of utf8proc license

Copyright

This document has been placed in the public domain.

TIP 726: Commands for Unicode normalization