TIP 597: "string is unicode" and better utf-8/utf-16/cesu-8 encodings

Login
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Tcl-Version:    8.7
Tcl-Branch:     tip-597
Vote-Summary:	Accepted 3/0/0
Votes-For:	DGP, JN, MC
Votes-Against:	none
Votes-Present:	none

Abstract

Currently, there is a discussion going on, about making Tcl more conform to the Unicode specification. Internally, Tcl allows to use Unicode codepoints which are not allowed to be exported to other applications, which makes Tcl non-conformant. TIP #573 was an attempt to solve this by forbidding the use of surrogates.

This TIP provides another approach. We can allow all characters internally, but modify the "utf-8"/"utf-16" encoders such that any surrogate or noncharacter (as defined by the Unicode standard) is replaced by the replacement character U+FFFD.

Further on, there is a new "cesu-8" encoding.

Finally "string bytelength" will be deprecated in Tcl 8.7 and fully removed in Tcl 9.0. The new encodings can be used to create a good replacement for this command.

Specification

Introduce a new command

This command will return 1 if the string does not contain any character from the following set:

Contrary to other string is commands, which adapt to the evolving Unicode standard, string is unicode will not change any more in future standards. This command cannot be used to check if a Unicode character is defined for a code point in the current Unicode standard. If the codepoint is available for a possible future assignment, string is unicode will return 1.

The string is unicode command can be used to check if the "utf-8"/"utf-16" encodings would deliver valid output, e.g.

if {[string is unicode $text]} {
    set f [open somefile.txt]
    fconfigure $f -encoding binary
    puts $f [encoding convertto utf-16 $text]
} else {
    puts stderr "Cannot write to file: non-conformant utf-16"
}
The encoding converto command currently has no other way to indicate encoding errors.

The problem with surrogates is that in the UTF-16 encoding there is no way to distinguish a surrogate-pair from a character > U+FFFF. Therefore the surrogate code-points (U+D800 - U+DFFF) are not allowed in UTF-8. When other applications receive such non-conformant UTF-8, behavior is undefined.

For a similar reason, noncharacters are problematic in UTF-16. Since U+FEFF is the BOM (Byte order Mark), allowing U+FFFE as possible value means we can no longer distinguish little-endian UTF-16 files from big-endian. And the U+FFFF pattern (short -1) is used very often in binary files, so allowing U+FFFF in UTF-16 makes it more difficult to distinguish binary files from UTF-16 text (the NULL-byte cannot be used for that, because it is allowed - and frequent - in UTF-16). That's why the last two characters in each plane (U+??FFFE - U+??FFFF) are defined as unicode "noncharacters".

Introduce a new API:

This function returns 1 if character is between 0x0000 and 0x10FFFE, and it is not a surrogate and not a noncharacter.

Introduce new "utf-8"/"utf-16" encodings. When converting from internal utf-8 to external utf-8/utf-16, any character for which string is unicode returns 0 will be produce the replacement character U+FFFD (bytes \xEF \xBF \xBD). When converting from external utf-8/utf16 to internal utf-8, nothing changes: The new utf-8/utf-16 decoders are forgiving for surrogates and noncharacters, they can continue to be processed by Tcl as-is.

Introduce a new "cesu-8" encoding. It's the same as "utf-8", only characters > U+FFFF will be output as a 6-byte sequence in stead of a 4-byte sequence, and the surrogate/noncharacter codepoints are considered conformant. See: CESU-8

The new encoders all implement the flag TCL_ENCODING_STOPONERROR (which is not accessible at script level). When this flag is set, the encoder/decoder will stop processing when it encounters a surrogate or noncharacter or some other problem (e.g. overlong byte sequences or missing continuation bytes)

Finally, deprecate the "string bytelength" command. It can be replaced by "string length [encoding convertto utf-8]. In Tcl 9, the "string bytelength" command will be removed.

Further enhancements

At this moment, Tcl doesn't have access to the TCL_ENCODING_STOPONERROR flag at script level. Work is ongoing (in the "encodings-with-flags" branch) to change that. This can be used to let the "utf-8" encoder automatically stop processing when it encounters a surrogate or noncharacter, in stead of producing \xEF \xBF \xBD (as proposed in this TIP). Since this change brings more complications, it is left out of scope for this TIP. But it would (IMHO) be a very useful addition.

Rejected alternatives

The "wtf-8"/"wtf-16" and "tcl-8" encodings, proposed earlier, are considered inappropriate. Exposing them would expose too much internal Tcl implementation at script level, which then cannot be changed any more in future Tcl versions. They potentially harm more than they benefit.

Implementation

Implementation is in Tcl branch tip-597

Compatibility

Since Tcl 8.6's "utf-8"/"utf-16" encoders can produce non-conformant utf-8/utf-16, and the new "utf-8"/"utf-16" encoders cannot any more, this introduces a potential incompatibility for applications which - illegally - export non-conformant utf-8/utf-16. Applications which - willingly - want to violate the Unicode standard can now use the "cesu-8" or "ucs-2" encoders in stead. The "utf-8"/"utf-16" decoders are unchanged, so Tcl can continue to handle non-conformant utf-8/utf-16 from other applications.

Copyright

This document has been placed in the public domain.