TIP 601: Make "encoding convertto/convertfrom" throw exceptions

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Harald Oehlmann <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Tcl-Version:    9.0
Tcl-Branch:     encodings-with-flags
Vote-Summary    Accepted 6/0/0
Votes-For:      FV, JD, JN, KW, MC, SL
Votes-Against:  none
Votes-Present:  none

Abstract

This TIP proposes to change the behavior of encoding convertto/convertfrom in Tcl 9.0, to start throwing an exception on any conversion error. An additional option -nocomplain is proposed to restore the Tcl 8.x behavior.

Rationale

The command encoding convertfrom/convertto currently does not raise any error on not convertible input. Instead the following actions are observed:

Example 1: not encodable character

The Polish character "L with bar" is not contained in ISO-latin 1:

% set s \u0141
Ł
% encoding convertto iso8859-1 $s
?

In the ISO-Latin 1 conversion, it is replaced by a question mark by the encoding convertto command.

Example 2: Incomplete sequence returns remaining value verbatim

The following utf-8 sequence has an incomplete final sequence. The second byte of the two byte sequence of the last character is missing. The incomplete sequence is interpreted as ISO8859-1 and added to the string.

 % set s [encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]]
 ÄÃ
 % scan $s %c%c
 196 195

The first character value 196 is the correct "Ä" character. The second character is the verbatim byte of the incomplete utf-8 sequence:

 % scan [encoding convertto utf-8 Ö] %c%c
 195 150

Use case 1: decode continuous multi-byte data

My personal use-case is a stream of UTF-8 data which is received by a USB character driver and the binary data is cut in 64 byte chunks. The stream is continuous and I want to decode the received data. If a UTF-8 byte is received partly, a false byte is created and the next chunk decoding does not work, as it starts with a part of a UTF-8 multibyte sequence.

It would be great to know, where the error is to stop the sequence.

Here is a code snipped with the current implementation:

% catch {encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]} e d
1
% set d
    -code 1\
    -level 0
    -errorstack {INNER {invokeStk1 ::tcl::encoding::convertfrom utf-8 Ã\x84Ã }}
    -errorcode {TCL ENCODING ILLEGALSEQUENCE 2}
    -errorinfo {unexpected byte sequence starting at index 2: '\xC3'
        while executing
        "encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]"} -errorline 1

Note: "\0x84" was replaced for the control character "IND" for visibility in the stack trace

Note: "C3" is the hexadecimal representation for decimal 195. Thus, "\xC3" is the first byte of the utf-8 representation of "Ö".

With this info, the -errorcode may be catched by a try clause and the error byte location (2) may be catched (see discussion section for an example). The data before may by passed again to encoding convertfrom which is the correct received data.

Note: it would be efficient, if the already converted string may be returned also. Then, the data must not be passed again to encoding convertfrom. The current implementation does not fullfill this optimisation.

Use case 2:

This use case is given in the following TCL ticket: Ticket 535705 :

Wrong characters are included in a data base by character replacement when a character not in the current system encoding. This causes issues in a multi-platform applications, as the error is not detected.

Specification

New Option

The encoding ensemble will be extended by a new option -nocomplain:

encoding convertfrom ?-nocomplain? ?encoding? data
encoding convertto ?-nocomplain? ?encoding? data

In Tcl 8.x, -nocomplain has no effect since those encoding subcommands currently never throw an exception: Any invalid byte/character is replaced (by ? or U+FFFD). In Tcl 9.0, those subcommands start throwing and exception on any conversion data error, -nocomplain restores the Tcl 8.x behavior.

Definition of "error position"

The position of the error in the source string is indicated in the error reporting. In case of multi-byte source data, this position is always one byte after the last correct multi-byte sequence.

###Error Message

The error message is: "unexpected byte sequence starting at index error position: 'byte value'", for encoding convertfrom or "unexpected character at index error position: 'character value'", for encoding convertto where error position is a number containing the source string error position as defined above.

byte value/character value is the hexadecimal representation of the byte in the source string where error position points to.

Error Code

The error code is composed of the following 4 list elements:

  1. Fix word: TCL
  2. Fix word: ENCODING
  3. Fix word: ILLEGALSEQUENCE
  4. Value error position: The index in the source string (usually a byte array, in case of encoding convertfrom) of the error position.

New C API

Introduce 2 new functions:

These functions behave the same as Tcl_ExternalToUtfDString/Tcl_UtfToExternalDString, only they have an additional flags parameter accepting the following additional values (can be used in combination):

The TCL_ENCODING_MODIFIED flag can be used in extensions for generating "modified" encodings, such as java (which uses "modified" cesu-8 internally). This flag is not exposed at script level, unlike -nocomplain.

The (already existing) TCL_ENCODING_STOPONERROR flag is only provided for legacy reasons. This flag will be meaningless starting with Tcl 9.0, therefore will be deprecated in Tcl 9.0 and eventually removed in a future Tcl version (but not yet in 9.0). In Tcl 9.0, TCL_ENCODING_STOPONERROR will be defined as value 0.

The return value of these two functions is the error-position in case of an error reporting, or (size_t)-1 if everything is OK.

Discussion

Ticket 535705

This TIP started in the TCL ticket 535705. Please refer to this ticket for information about the initial discussion.

Error reporting design

The list of categories for the error code return is given in the tclvars manual page. The TCL category matches best.

This design allows to catch this error and get the error position by the following try pattern:

try {
    set res [encoding convertto iso8859-1 $input]
} trap {TCL ENCODING ILLEGALSEQUENCE} {errorMessage errorDict} {
    set errorIndex [lindex [dict get $errorDict -errorcode] 3]
    ...
}

Rejected alternatives

Report the error character

The original implementation reported the failing character/byte in the error message. This may be a control character corrupting a terminal view. IMHO error messages should be in the printable ASCII character set. Therefore this was removed, the character/byte is now only reported in hexadecimal notation.

-nocomplain as boolean option

An alternative would be to use -nocomplain 1 in stead of -nocomplain and -nocomplain 0 in stead of -nocomplain. This only makes the command longer, without much benefit.

-stoponerror

This option is the reverse of -nocomplain, but much less descriptive. In combination with -nocomplain, it causes more confusion than that it helps.

EILSEQ POSIX error code

Recent changes to TCL use the POSIX error EILSEQ: "invalid byte sequence", which looks like the most appropriate error message. Nevertheless, the POSIX message does not allow to return the error position.

Alternate solutions

See [607], which is not actually an alternative, but can be added later on top of this TIP.

Implementation

Implementation is in Tcl branch encodings-with-flags

Compatibility

The implementation is fully backward compatible for 8.7. There is a compatibility break for TCL 9.0.

Credits

Thanks to Jan Nijtmans for idea and implementation.

Copyright

This document has been placed in the public domain.