Author: Alexandre Ferrieux <[email protected]> Author: Jan Nijtmans <[email protected]> State: Draft Type: Project Vote: In progress Created: 02-Feb-2009 Post-History: Keywords: Tcl,encoding,convertto,strict,Unicode Tcl-Version: 8.7 Tcl-Branch: tip-346
This TIP proposes to raise an error when an encoding-based conversion loses information.
Encoding-based conversions occur e.g. when writing a string to a
channel. In doing so, Unicode characters are converted to sequences of bytes
according to the channel's encoding. Similarly, a conversion can occur
on request of the ByteArray internal representation of an object, the target
encoding being ISO8859-1. In both cases, for some
combinations of Unicode char and target encoding, the mapping is lossy
(non-injective). For example, the "
é" character, and many of its
cousins, is mapped to a "
?" in the 'ascii' target encoding.
This loss of information, in the first case, introduces unnoticed i18n mishandlings.
This TIP proposes to make this loss conspicuous.
The idea is to introduce a -strict option to encoding convertto/encoding convertfrom, that would raise an explicit error when non-mappable characters are met. Every encoder/decoder can decide, depending on this flag, if it tries to convert invalid (but common) byte sequences to valid characters or not.
This -strict option cannot be combined with the already existing -nocomplain and -failvar options.
For channels, there's a new -strictencoding option, to be used in
which can be set to true (default: false). Setting it to true has the same
effect as the -strict option for encoding convertto/encoding convertfrom.
For the utf-8 encoding, the -strict options has an additional effect for 2 different situations it generates an exception for:
Byte sequence '\xC0\x80' is normally accepted by the utf-8 decoder, since it's a representation of '\x00'. When using -strict, this byte sequence will raise an error.
% encoding convertfrom -strict utf-8 \xC0\x80 unexpected byte sequence starting at index 0: '\xC0'
Invalid byte sequences. By default, the utf-8 decoder detects invalid byte sequences, but - if encountered - tries to handle them as if they were iso8859-1 or cp1252. When using -strict, this now will raise an error.
As said, other encoders hand handle -strict in their own way, currently
utf-8 decoder is the only 'core' encoding affected by it (but that will probably change).
A new flag
TCL_ENCODING_STRICT is introduced. This flag inherits all
TCL_ENCODING_STOPONERROR, but for the utf-8 decoder
it enables some additional checks, which result in an exception.
This flag can be used in
and all related API. Other encoders/decoders can use this flag
in their own way.
At this moment, the implementation is not 100% OK yet. There's a known bug 6978c01b65 which prevents an exception to be thrown when writing to a channel. Also, the only additional check implemented now is for the '\xC0\x80' byte sequence. After this TIP is accepted, more strict checks will be added to more encodings: the meaning of invalid byte sequence is different for every encoder.
The original version of this TIP also described a matching -strictencoding [fconfigure] option, which now moved to TIP #633. And it contained changes to SetByteArrayFromAny, which are now available in TIP #568.
See the original TIP #346 here.
See TIP #346.
This document has been placed in the public domain.