TIP 346: Error on Failed String Encodings

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Alexandre Ferrieux <[email protected]>
Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Project
Vote:           In progress
Created:        02-Feb-2009
Post-History:   
Keywords:       Tcl,encoding,convertto,strict,Unicode
Tcl-Version:    8.7
Tcl-Branch:     tip-346

Abstract

This TIP proposes to raise an error when an encoding-based conversion loses information.

Background

Encoding-based conversions occur e.g. when writing a string to a channel. In doing so, Unicode characters are converted to sequences of bytes according to the channel's encoding. Similarly, a conversion can occur on request of the ByteArray internal representation of an object, the target encoding being ISO8859-1. In both cases, for some combinations of Unicode char and target encoding, the mapping is lossy (non-injective). For example, the "é" character, and many of its cousins, is mapped to a "?" in the 'ascii' target encoding.

This loss of information, in the first case, introduces unnoticed i18n mishandlings.

Proposed Change

This TIP proposes to make this loss conspicuous.

The idea is to introduce a -strict option to encoding convertto/encoding convertfrom, that would raise an explicit error when non-mappable characters are met. Every encoder/decoder can decide, depending on this flag, if it tries to convert invalid (but common) byte sequences to valid characters or not.

This -strict option cannot be combined with the already existing -nocomplain and -failvar options.

For channels, there's a new -strictencoding option, to be used in fconfigure, which can be set to true (default: false). Setting it to true has the same effect as the -strict option for encoding convertto/encoding convertfrom.

For the utf-8 encoding, the -strict options has an additional effect for 2 different situations it generates an exception for:

As said, other encoders hand handle -strict in their own way, currently the utf-8 decoder is the only 'core' encoding affected by it (but that will probably change).

Implementation.

A new flag TCL_ENCODING_STRICT is introduced. This flag inherits all behavior of TCL_ENCODING_STOPONERROR, but for the utf-8 decoder it enables some additional checks, which result in an exception. This flag can be used in Tcl_UtfToExternalDStringEx()/Tcl_ExternalToUtfDStringEx() and all related API. Other encoders/decoders can use this flag in their own way.

At this moment, the implementation is not 100% OK yet. There's a known bug 6978c01b65 which prevents an exception to be thrown when writing to a channel. Also, the only additional check implemented now is for the '\xC0\x80' byte sequence. After this TIP is accepted, more strict checks will be added to more encodings: the meaning of invalid byte sequence is different for every encoder.

History

The original version of this TIP also described a matching -strictencoding [fconfigure] option, which now moved to TIP #633. And it contained changes to SetByteArrayFromAny, which are now available in TIP #568.

See the original TIP #346 here.

Reference Example

See TIP #346.

Copyright

This document has been placed in the public domain.