Author: Alexandre Ferrieux <[email protected]> Author: Jan Nijtmans <[email protected]> State: Final Type: Project Vote: Done Created: 02-Feb-2009 Post-History: Keywords: Tcl,encoding,convertto,strict,Unicode Tcl-Version: 8.7 Tcl-Branch: tip-346 Vote-Summary Accepted 5/0/0 Votes-For: FV, JN, KW, MC, SL Votes-Against: none Votes-Present: none
This TIP proposes to raise an error when an encoding-based conversion loses information.
Encoding-based conversions occur e.g. when writing a string to a
channel. In doing so, Unicode characters are converted to sequences of bytes
according to the channel's encoding. For some
combinations of Unicode char and target encoding, the mapping is lossy
(non-injective). For example, the "
é" character and many of its
cousins is mapped to a "
?" in the 'ascii' target encoding. This loss of
information is sometimes not desired.
This TIP proposes to make this loss conspicuous.
The idea is to introduce a -strict option to encoding convertto/encoding convertfrom, that would raise an explicit error when non-mappable characters are met. Every encoder/decoder can decide, depending on this flag, if it tries to convert invalid (but common) byte sequences to valid characters or not.
This -strict option cannot be combined with the already existing -nocomplain option (see also "History" section below).
For channels, there's a new -strictencoding option, to be used in
which can be set to true (default: false). Setting it to true has the same
effect as the -strict option for encoding convertto/encoding convertfrom.
utf-32 encodings, the -strict options has an additional effect for
3 different situations it generates an exception for:
Byte sequence '\xC0\x80' is normally accepted by the utf-8 decoder, since it's a representation of '\x00'. When using -strict, this byte sequence will raise an error.
% encoding convertfrom -strict utf-8 \xC0\x80 unexpected byte sequence starting at index 0: '\xC0'
Invalid byte sequences. By default, the utf-8 decoder detects invalid byte sequences, but - if encountered - tries to handle them as if they were iso8859-1 or cp1252. When using -strict, this now will raise an error.
Surrogates. By default, the
utf-32decoders detect surrogates, but - if encountered - lets them pass through unchanged. When using -strict, this now will raise an error. (This check was added after this TIP was accepted, see last sentence in "Implementation" section below)
As said, other encoders handle -strict in their own way. Currently
utf-32 decoders and the table-based decoders
have -strict handling implemented.
A new flag
TCL_ENCODING_STRICT is introduced. This flag inherits all
TCL_ENCODING_STOPONERROR, but for the utf-8 decoder
it enables some additional checks, which result in an exception.
This flag can be used in
and all related API. Other encoders/decoders can use this flag
in their own way.
At this moment, the implementation is not 100% OK yet. There's a known bug 6978c01b65 which prevents an exception to be thrown when writing to a channel. Also, the only additional check implemented now is for the '\xC0\x80' byte sequence. After this TIP is accepted, more strict checks will be added to more encodings: the meaning of invalid byte sequence is different for every encoder.
The original version of this TIP also contained changes to SetByteArrayFromAny, which are now available in TIP #568.
See the original TIP #346 here.
The TIP - as voted - contained the sentence: "This -strict option cannot be combined with the already existing -nocomplain and -failindex options.". After folow-up discussion, it turned out to be a good idea to allow -strict with -failindex (See: [a31caff057]), therefore this sentence is modified in the TIP.
The TIP - as voted - didn't handle Surrogates, but as Surrogates are
considered as invalid in
utf-32 (unless paired
utf-16), the TIP text is adapted accordingly.
Some wording that was relevant only to the now-removed ByteArray portion of this TIP was later also removed.
See TIP #346.
This document has been placed in the public domain.