Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Tcl-Version: 8.7
Tcl-Branch: tip-597
Vote-Summary: Accepted 3/0/0
Votes-For: DGP, JN, MC
Votes-Against: none
Votes-Present: none
Abstract
Currently, there is a discussion going on, about making Tcl more conform to the Unicode specification. Internally, Tcl allows to use Unicode codepoints which are not allowed to be exported to other applications, which makes Tcl non-conformant. TIP #573 was an attempt to solve this by forbidding the use of surrogates.
This TIP provides another approach. We can allow all characters internally, but modify the "utf-8"/"utf-16" encoders such that any surrogate or noncharacter (as defined by the Unicode standard) is replaced by the replacement character U+FFFD.
Further on, there is a new "cesu-8" encoding.
Finally "string bytelength" will be deprecated in Tcl 8.7 and fully removed in Tcl 9.0. The new encodings can be used to create a good replacement for this command.
Specification
Introduce a new command
string is unicode
This command will return 1 if the string does not contain any character from the following set:
- 2048 surrogates (U+D800 - U+DFFF)
- 66 noncharacters (U+??FFFE - U+??FFFF and U+FDD0 - U+FDEF)
Contrary to other string is
commands, which adapt to the
evolving Unicode standard, string is unicode
will not
change any more in future standards. This command cannot be used to
check if a Unicode character is defined for a code point in the
current Unicode standard. If the codepoint is available for a
possible future assignment, string is unicode
will return 1.
The string is unicode
command can be used to check if the
"utf-8"/"utf-16" encodings would deliver valid output, e.g.
if {[string is unicode $text]} { set f [open somefile.txt] fconfigure $f -encoding binary puts $f [encoding convertto utf-16 $text] } else { puts stderr "Cannot write to file: non-conformant utf-16" }The
encoding converto
command currently has no other way
to indicate encoding errors.
The problem with surrogates is that in the UTF-16 encoding there is no way to distinguish a surrogate-pair from a character > U+FFFF. Therefore the surrogate code-points (U+D800 - U+DFFF) are not allowed in UTF-8. When other applications receive such non-conformant UTF-8, behavior is undefined.
For a similar reason, noncharacters are problematic in UTF-16. Since U+FEFF is the BOM (Byte order Mark), allowing U+FFFE as possible value means we can no longer distinguish little-endian UTF-16 files from big-endian. And the U+FFFF pattern (short -1) is used very often in binary files, so allowing U+FFFF in UTF-16 makes it more difficult to distinguish binary files from UTF-16 text (the NULL-byte cannot be used for that, because it is allowed - and frequent - in UTF-16). That's why the last two characters in each plane (U+??FFFE - U+??FFFF) are defined as unicode "noncharacters".
Introduce a new API:
int Tcl_UniCharIsUnicode(int character)
This function returns 1 if character
is between 0x0000 and
0x10FFFE, and it is not a surrogate and not a noncharacter.
Introduce new "utf-8"/"utf-16" encodings. When converting from
internal utf-8 to external utf-8/utf-16, any character for which
string is unicode
returns 0 will be produce the replacement
character U+FFFD (bytes \xEF \xBF \xBD). When converting from
external utf-8/utf16 to internal utf-8, nothing changes: The new
utf-8/utf-16 decoders are forgiving for surrogates and noncharacters,
they can continue to be processed by Tcl as-is.
Introduce a new "cesu-8" encoding. It's the same as "utf-8", only characters > U+FFFF will be output as a 6-byte sequence in stead of a 4-byte sequence, and the surrogate/noncharacter codepoints are considered conformant. See: CESU-8
The new encoders all implement the flag TCL_ENCODING_STOPONERROR (which is not accessible at script level). When this flag is set, the encoder/decoder will stop processing when it encounters a surrogate or noncharacter or some other problem (e.g. overlong byte sequences or missing continuation bytes)
Finally, deprecate the "string bytelength" command. It can be replaced by "string length [encoding convertto utf-8]. In Tcl 9, the "string bytelength" command will be removed.
Further enhancements
At this moment, Tcl doesn't have access to the TCL_ENCODING_STOPONERROR flag at script level. Work is ongoing (in the "encodings-with-flags" branch) to change that. This can be used to let the "utf-8" encoder automatically stop processing when it encounters a surrogate or noncharacter, in stead of producing \xEF \xBF \xBD (as proposed in this TIP). Since this change brings more complications, it is left out of scope for this TIP. But it would (IMHO) be a very useful addition.
Rejected alternatives
The "wtf-8"/"wtf-16" and "tcl-8" encodings, proposed earlier, are considered inappropriate. Exposing them would expose too much internal Tcl implementation at script level, which then cannot be changed any more in future Tcl versions. They potentially harm more than they benefit.
Implementation
Implementation is in Tcl branch tip-597
Compatibility
Since Tcl 8.6's "utf-8"/"utf-16" encoders can produce non-conformant utf-8/utf-16, and the new "utf-8"/"utf-16" encoders cannot any more, this introduces a potential incompatibility for applications which - illegally - export non-conformant utf-8/utf-16. Applications which - willingly - want to violate the Unicode standard can now use the "cesu-8" or "ucs-2" encoders in stead. The "utf-8"/"utf-16" decoders are unchanged, so Tcl can continue to handle non-conformant utf-8/utf-16 from other applications.
Copyright
This document has been placed in the public domain.