Author:         Alexandre Ferrieux <[email protected]>
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        02-Feb-2009
Post-History:   
Keywords:       Tcl,encoding,convertto,strict,Unicode
Tcl-Version:    8.7
Tcl-Branch:     tip-346
Obsoleted-By:	654
Vote-Summary   Accepted 5/0/0
Votes-For:     FV, JN, KW, MC, SL
Votes-Against: none
Votes-Present: none

Abstract

This TIP proposes to raise an error when an encoding-based conversion loses information.

Background

Encoding-based conversions occur e.g. when writing a string to a channel. In doing so, Unicode characters are converted to sequences of bytes according to the channel's encoding. For some combinations of Unicode char and target encoding, the mapping is lossy (non-injective). For example, the "é" character and many of its cousins is mapped to a "?" in the 'ascii' target encoding. This loss of information is sometimes not desired.

Proposed Change

This TIP proposes to make this loss conspicuous.

The idea is to introduce a -strict option to encoding convertto/encoding convertfrom, that would raise an explicit error when non-mappable characters are met. Every encoder/decoder can decide, depending on this flag, if it tries to convert invalid (but common) byte sequences to valid characters or not.

This -strict option cannot be combined with the already existing -nocomplain option (see also "History" section below).

For channels, there's a new -strictencoding option, to be used in fconfigure, which can be set to true (default: false). Setting it to true has the same effect as the -strict option for encoding convertto/encoding convertfrom.

For the utf-8/utf-16/utf-32 encodings, the -strict options has an additional effect for 3 different situations it generates an exception for:

Byte sequence '\xC0\x80' is normally accepted by the utf-8 decoder, since it's a representation of '\x00'. When using -strict, this byte sequence will raise an error.
```
% encoding convertfrom -strict utf-8 \xC0\x80
unexpected byte sequence starting at index 0: '\xC0'
```
Invalid byte sequences. By default, the utf-8 decoder detects invalid byte sequences, but - if encountered - tries to handle them as if they were iso8859-1 or cp1252. When using -strict, this now will raise an error.
Surrogates. By default, the utf-8/utf-16/utf-32 decoders detect surrogates, but - if encountered - lets them pass through unchanged. When using -strict, this now will raise an error. (This check was added after this TIP was accepted, see last sentence in "Implementation" section below)

As said, other encoders handle -strict in their own way. Currently the 'core' utf-8/utf-16/utf-32 decoders and the table-based decoders have -strict handling implemented.

Implementation.

A new flag TCL_ENCODING_STRICT is introduced. This flag inherits all behavior of TCL_ENCODING_STOPONERROR, but for the utf-8 decoder it enables some additional checks, which result in an exception. This flag can be used in Tcl_UtfToExternalDStringEx()/Tcl_ExternalToUtfDStringEx() and all related API. Other encoders/decoders can use this flag in their own way.

At this moment, the implementation is not 100% OK yet. There's a known bug 6978c01b65 which prevents an exception to be thrown when writing to a channel. Also, the only additional check implemented now is for the '\xC0\x80' byte sequence. After this TIP is accepted, more strict checks will be added to more encodings: the meaning of invalid byte sequence is different for every encoder.

History

The original version of this TIP also contained changes to SetByteArrayFromAny, which are now available in TIP #568.

See the original TIP #346 here.

The TIP - as voted - contained the sentence: "This -strict option cannot be combined with the already existing -nocomplain and -failindex options.". After folow-up discussion, it turned out to be a good idea to allow -strict with -failindex (See: [a31caff057]), therefore this sentence is modified in the TIP.

The TIP - as voted - didn't handle Surrogates, but as Surrogates are considered as invalid in utf-8/utf-16/utf-32 (unless paired properly in utf-16), the TIP text is adapted accordingly.

Some wording that was relevant only to the now-removed ByteArray portion of this TIP was later also removed.

Reference Example

See TIP #346.

Copyright

This document has been placed in the public domain.