Author: Ashok P. Nadkarni <[email protected]>
State: Draft
Type: Project
Vote: Pending
Created: 2025-06-24
Tcl-Version: 9.1
Tcl-Branch: tip-726
Keywords: strings Unicode
Abstract
This TIP proposes commands and a C API for normalization of Unicode character strings.
Specification
Commands
This TIP introduces a new command ensemble, unicode
, with the
following commands:
unicode tonfc ?-profile PROFILE? STRING
unicode tonfd ?-profile PROFILE? STRING
unicode tonfkc ?-profile PROFILE? STRING
unicode tonfkd ?-profile PROFILE? STRING
The tonfc
, tonfd
, tonfkc
, tonfkd
subcommands convert the passed
STRING
to the Unicode normalization forms Normalization Form C (NFC),
Normalization Form D (NFD), Normalization Form KC (NFKC) and
Normalization Form KD (NFKD) respectively. For the definition of these
forms as see Section 3.11
in Chapter 3
of the Unicode standard.
The -profile
option has the same semantics as that in Tcl's encoding
command but will only accept strict
(default) and replace
as valid
values.
C API
The corresponding C level API is defined as
typedef enum {
TCL_NFC, TCL_NFD, TCL_NFKC, TCL_NFKD
} Tcl_UnicodeNormalizationForm;
const char *Tcl_UtfToNormalizedDString(
Tcl_Interp *,
const char *sourceString,
Tcl_UnicodeNormalizationForm normalizationForm,
int profile,
Tcl_DString *dsPtr);
The sourceString
operand must be in Tcl's internal UTF-8 format.
The normalizationForm
argument specifies the Unicode normalization
form. The profile
argument must be one of TCL_ENCODING_PROFILE_STRICT
or TCL_ENCODING_PROFILE_REPLACE
and has the same semantics as
the encoding routines.
The normalized string is stored in dsPtr
in Tcl's internal UTF-8 format.
The function initializes *dsPtr
. Caller must call Tcl_DStringFree
on success.
On success, the function returns a pointer to the normalized string
stored in dsPtr
. On error, a NULL pointer is returned.
Implementation library and license
The open source utf8proc library forms the core of the implementation. This will be statically linked into Tcl. The open source license for the library, reproduced below, is considered compatible with Tcl's standard license without imposing additional terms.
Discussion
Alternative syntax
Some variations of the syntax were discussed.
Suggestion
To avoid name clashes with existing extensions the unicode
ensemble may be
placed under the tcl
namespace.
Suggestion
Instead of multiple commands, a single normalize
command was initially
proposed.
unicode normalize ?-mode MODE? ?-profile PROFILE? STRING
However, the multiple command syntax was preferred for succintness.
Suggestion
Several people have also suggested making the commands part of the string
ensemble, for example string normalize
or string tonfc
etc.. This is a
reasonable option but the author's (mild) preference for a separate command
ensemble stems from a desire to separate commands whose domain of functionality
is strictly Unicode code points, i.e. the string
command, from commands that
understand Unicode abstract characters, i.e. the proposed unicode
command. In
future proposals, the command would be expanded with additional functionality
related to collation, case folding, glyph boundaries etc. (In a sense the
binary
ensemble is not part of string
for similar reasons -- different
conceptual domain).
Having said that, if the majority feel `string** would be appropriate to host the new commands, no strong objection from the author.
Implementation choices
This section is really the main purpose of this TIP. There are several possible implementation strategies that are listed here along with their pros and cons. No specific recommendation is made at this time pending discussion of this TIP. Suggestions welcome for alternatives not mentioned below.
Criteria for selection include
Supported features. While this TIP only focuses on normalization, some consideration might be due for future Unicode related enhancements such as collation, boundary rules, paragraph layouts etc.
Cross-platform availability and compatibility. Any core Tcl features need to be available on all supported platforms and ideally produce the same output.
Engineering effort and ease of integration including future maintenance.
Performance and code size considerations.
Implement enhancements to Tcl's Unicode modules.
Tcl already has the data tables for some Unicode character classes such as numerics, whitespace etc. but as far as I know does not include the information needed for normalization. Further, the rules and algorithms for normalization are not trivial to put it mildly so unless someone commits to implement and maintain for the future, I do not consider this a viable option.
Bind to the ICU library
The ICU C library is
the most widely used cross-platform library as well as the one with the most
features. Bindings for normalization already exist in the tcl::unsupported
namespace and some features such as word breaks are already being used by Tk
if available so minimal (but not zero) further effort is required inclusion.
There are some potential drawbacks to ICU:
The implementation of ICU itself (not the bindings) is complex and it is unlikely to be suitable for direct inclusion into the Tcl code base. The implication is that Tcl has to rely for the presence of the library on the target platforms. While most major platforms have ICU installed, this is not universally true. For example, older editions of Windows 10, do not. Certain embedded environments etc. may not have be capable of supporting ICU at all.
A lesser issue is that the layout of ICU binaries differs by platform and ICU version. The current implementation of Tcl ICU bindings (stems from Tk) does a search at runtime and then dynamically looks up the function entry points. Some work is required in the build system to fix this. Quite possibly, not much work but the author views anything to do with autoconf with trepidation.
An even lesser issue is that from a performance perspective, ICU interfaces are primarily UTF-16 and calls require a translation step from Tcl 9's internal UTF-8 and UTF-32 formats.
Bind to the ICU4X library
The ICU4X library is a newer reimplementation of most features of ICU (notably not the encoders). It is (purportedly) smaller, faster and more modular than ICU. The implementation is written in Rust but as a shared library it is callable from C and the author has a private binding from Tcl as a proof of concept.
The potential drawbacks are:
Implementation in Rust means it is unlikely to ever ship as part of Tcl and will have to rely on availability through package managers on target platforms. For Windows, we will probably have to ship pre-built binaries.
Being relatively new, it is not as widely available as ICU as part of package managers.
Statically link to the utf8proc library
The utf8proc library is UTF-8 processing library used in the Julia language that provides a very limited set of functions compared to ICU or ICU4X. Its primary advantage over ICU or ICU4X is its small size in relative terms (~360KB binary) and ease of integration into Tcl because of its MIT license and C99 implementation. It allows provision of cross-platform normalization functions independent of the presence of system libraries. A minor benefit is that its API also uses 32-bit type to represent code points as in Tcl.
The main drawback is that it offers very limited functionality by design.
A Tcl binding is available as an extension.
Proposed implementation
For the 9.1 release, it is proposed to use the utf8proc
library to add
support for normalization.
- It is easily integrated within the Tcl core.
- It does not rely on any system libraries and will be available on all platforms that can run Tcl.
- Its limited functionality is irrelevant at this time because adding support for ICU features it does not support will require far more extensive changes to Tcl and Tk. If those features are ever added to Tcl/Tk, the library can be trivially removed.
Note the current use of ICU within tcl::unsupported::icu
will
continue as it provides other functionality as sniffing of encodings.
If the above is not acceptable or desirable for any reason, the second
choice would be to promote the tcl::unsupported::icu normalize
command
as a supported command with the caveat that it will only be available
on platforms where ICU is present.
Reliance on system libraries means older implementations, compatibility changes but compatible with other applications.
utf8proc library license
The license
for the utf8proc
library is reproduced below.
Start of utf8proc license
utf8proc license
utf8proc is a software package originally developed by Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this library, that is now maintained by the Julia-language developers. Like the original utf8proc, whose copyright and license statements are reproduced below, all new work on the utf8proc library is licensed under the MIT "expat" license:
Copyright © 2014-2021 by Steven G. Johnson, Jiahao Chen, Tony Kelman, Jonas Fonseca, and other contributors listed in the git history.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Original utf8proc license
Copyright (c) 2009, 2013 Public Software Group e. V., Berlin, Germany
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Unicode data license
This software contains data (utf8proc_data.c
) derived from processing
the Unicode data files. The following license applies to that data:
COPYRIGHT AND PERMISSION NOTICE
Copyright (c) 1991-2007 Unicode, Inc. All rights reserved. Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and any associated documentation (the "Data Files") or Unicode software and any associated documentation (the "Software") to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that (a) the above copyright notice(s) and this permission notice appear with all copies of the Data Files or Software, (b) both the above copyright notice(s) and this permission notice appear in associated documentation, and (c) there is clear notice in each modified Data File or in the Software as well as in the documentation associated with the Data File(s) or Software that the data or software has been modified.
THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE.
Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in these Data Files or Software without prior written authorization of the copyright holder.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and may be registered in some jurisdictions. All other trademarks and registered trademarks mentioned herein are the property of their respective owners.
End of utf8proc license
Copyright
This document has been placed in the public domain.