Author: Alexandre Ferrieux <[email protected]>
Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 02-Feb-2009
Post-History:
Keywords: Tcl,encoding,convertto,strict,Unicode
Tcl-Version: 8.7
Tcl-Branch: tip-346
Obsoleted-By: 654
Vote-Summary Accepted 5/0/0
Votes-For: FV, JN, KW, MC, SL
Votes-Against: none
Votes-Present: none
Abstract
This TIP proposes to raise an error when an encoding-based conversion loses information.
Background
Encoding-based conversions occur e.g. when writing a string to a
channel. In doing so, Unicode characters are converted to sequences of bytes
according to the channel's encoding. For some
combinations of Unicode char and target encoding, the mapping is lossy
(non-injective). For example, the "é
" character and many of its
cousins is mapped to a "?
" in the 'ascii' target encoding. This loss of
information is sometimes not desired.
Proposed Change
This TIP proposes to make this loss conspicuous.
The idea is to introduce a -strict option to encoding convertto/encoding convertfrom, that would raise an explicit error when non-mappable characters are met. Every encoder/decoder can decide, depending on this flag, if it tries to convert invalid (but common) byte sequences to valid characters or not.
This -strict option cannot be combined with the already existing -nocomplain option (see also "History" section below).
For channels, there's a new -strictencoding option, to be used in fconfigure
,
which can be set to true (default: false). Setting it to true has the same
effect as the -strict option for encoding convertto/encoding convertfrom.
For the utf-8
/utf-16
/utf-32
encodings, the -strict options has an additional effect for
3 different situations it generates an exception for:
Byte sequence '\xC0\x80' is normally accepted by the utf-8 decoder, since it's a representation of '\x00'. When using -strict, this byte sequence will raise an error.
% encoding convertfrom -strict utf-8 \xC0\x80 unexpected byte sequence starting at index 0: '\xC0'
Invalid byte sequences. By default, the utf-8 decoder detects invalid byte sequences, but - if encountered - tries to handle them as if they were iso8859-1 or cp1252. When using -strict, this now will raise an error.
Surrogates. By default, the
utf-8
/utf-16
/utf-32
decoders detect surrogates, but - if encountered - lets them pass through unchanged. When using -strict, this now will raise an error. (This check was added after this TIP was accepted, see last sentence in "Implementation" section below)
As said, other encoders handle -strict in their own way. Currently
the 'core' utf-8
/utf-16
/utf-32
decoders and the table-based decoders
have -strict handling implemented.
Implementation.
A new flag TCL_ENCODING_STRICT
is introduced. This flag inherits all
behavior of TCL_ENCODING_STOPONERROR
, but for the utf-8 decoder
it enables some additional checks, which result in an exception.
This flag can be used in Tcl_UtfToExternalDStringEx()
/Tcl_ExternalToUtfDStringEx()
and all related API. Other encoders/decoders can use this flag
in their own way.
At this moment, the implementation is not 100% OK yet. There's a known bug 6978c01b65 which prevents an exception to be thrown when writing to a channel. Also, the only additional check implemented now is for the '\xC0\x80' byte sequence. After this TIP is accepted, more strict checks will be added to more encodings: the meaning of invalid byte sequence is different for every encoder.
History
The original version of this TIP also contained changes to SetByteArrayFromAny, which are now available in TIP #568.
See the original TIP #346 here.
The TIP - as voted - contained the sentence: "This -strict option cannot be combined with the already existing -nocomplain and -failindex options.". After folow-up discussion, it turned out to be a good idea to allow -strict with -failindex (See: [a31caff057]), therefore this sentence is modified in the TIP.
The TIP - as voted - didn't handle Surrogates, but as Surrogates are
considered as invalid in utf-8
/utf-16
/utf-32
(unless paired
properly in utf-16
), the TIP text is adapted accordingly.
Some wording that was relevant only to the now-removed ByteArray portion of this TIP was later also removed.
Reference Example
See TIP #346.
Copyright
This document has been placed in the public domain.