TIP 601: "encoding convertto/convertfrom" option to stop on error

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Harald Oehlmann <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Tcl-Version:    8.7
Tcl-Branch:     encodings-with-flags

Abstract

An additional option is proposed for encoding convertto/convertfrom to stop on the first encoding issue. In addition, this change is proposed as default vaue for TCL9.x series.

Rationale

The command encoding convertfrom/convertto currently does not raise any error on not convertable input. Instead the following actions are observed:

Example 1: not encodable character

The Polish character "L with bar" is not contained in ISO-latin 1:

% set s \u0141
Ł
% encoding convertto iso8859-1 $s
?

In the ISO-Latin 1 conversion, it is replaced by a question mark by the encoding convertto command.

Example 2: Incomplete sequence returns remaining value verbatim

The following utf-8 sequence has an incomplete final sequence. The second byte of the two byte sequence of the last character is missing. The incomplete sequence is interpreted as ISO8859-1 and added to the string.

 % set s [encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]]
 ÄÃ
 % scan $s %c%c
 196 195

The first character value 196 is the correct "Ä" character. The second character is the verbatim byte of the incomplete utf-8 sequence:

 % scan [encoding convertto utf-8 Ö] %c%c
 195 150

Use case 1: decode continuous multi-byte data

My personal use-case is a stream of UTF-8 data which is received by a USB character driver and the binary data is cut in 64 byte chunks. The stream is continuous and I want to decode the received data. If a UTF-8 byte is received partly, a false byte is created and the next chunk decoding does not work, as it starts with a part of a UTF-8 multibyte sequence.

It would be great to know, where the error is to stop the sequence.

Here is a code snipped with the current implementation:

% catch {encoding convertfrom -stoponerror utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]} e d
1
% set d
    -code 1\
    -level 0
    -errorstack {INNER {invokeStk1 ::tcl::encoding::convertfrom -stoponerror utf-8 Ã\x84Ã }}
    -errorcode {TCL ENCODING STOPONERROR 2}
    -errorinfo {unexpected byte sequence starting at index 2: '\xC3'
        while executing
        "encoding convertfrom -stoponerror utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]"} -errorline 1

Note: "\0x84" was replaced for the control character "IND" for visibility in the stack trace

Note: "C3" is the hexadecimal representation for decimal 195. Thus, "\xC3" is the first byte of the utf-8 representation of "Ö".

With this info, the -errorcode may be catched by a try clause and the error byte location (2) may be catched (see discussion section for an example). The data before may by passed again to encoding convertfrom which is the correct received data.

Note: it would be efficient, if the already converted string may be returned also. Then, the data must not be passed again to encoding convertfrom. The current implementation does not fullfill this optimisation.

Use case 2:

This use case is given in the following TCL ticket: Ticket 535705 :

Wrong characters are included in a data base by character replacement when a character not in the current system encoding. This causes issues in a multi-platform applications, as the error is not detected.

Specification

New Option

The encoding ensemble will be extended by two new options -nothrow and -stoponerror:

encoding convertfrom ?-nothrow|-stoponerror? ?encoding? data
encoding convertto ?-nothrow|-stoponerror|? ?encoding? data

The specified options have the following functionality:

Options Description
-nothrow Any conversion data error is handled by replacement (by ? or U+FFFD) of the concerned byte
-stoponerror The conversion stops with an error on any conversion data error

The default value is -nothrow for TCL 8.x and -stoponerror for TCL 9.x.

Error reporting

If -stoponerror is active, the following error reporting takes place in case of an conversion error.

Definition of "error position"

The position of the error in the source string is indicated in the error reporting. In case of multi-byte source data, this position is always one byte after the last correct multi-byte sequence.

Error Message

The error message is: "unexpected byte sequence starting at index error position: 'byte value'", for encoding convertfrom or "unexpected character at index error position: 'character value'", for encoding convertto where error position is a number containing the source string error position as defined above.

byte value/character value is the hexadecimal representation of the byte in the source string where error position points to.

Error Code

The error code is composed of the following 4 list elements:

  1. Fix word: TCL
  2. Fix word: ENCODING
  3. Fix word: STOPONERROR
  4. Value error position: The index in the source string (usually a byte array, in case of encoding convertfrom) of the error position.

New C API

Introduce 2 new functions:

These functions behave the same as Tcl_ExternalToUtfDString/Tcl_UtfToExternalDString, only they have an additional flags parameter accepting the following additional values (can be used in combination):

The TCL_ENCODING_MODIFIED flag can be used in extensions for generating "modified" encodings, such as java (which uses "modified" cesu-8 internally). This flag is not exposed at script level, unlike -stoponerror/-nothrow.

The return value of these two functions is the error-position in case of an error reporting, or (size_t)-1 if everything is OK.

Discussion

Ticket 535705

This TIP started in the TCL ticket 535705. Please refer to this ticket for information about the initial discussion.

Option design

The requirement to have two ortogonal flags is due to the fact, that it was proposed to change the default value from -nothrow to -stoponerror with TCL9. Otherwise, it is not possible to write code for TCL8 and TCL9. The ortogonal option -nothrowis not required if this default change is not seen as a possibility (it may be a vote option).

Error reporting design

The list of categories for the error code return is given in the tclvars manual page. The TCL category matches best.

This design allows to catch this error and get the error position by the following try pattern:

try {
    set res [encoding convertto -stoponerror iso8859-1 $input]
} trap {TCL ENCODING STOPONERROR} {errorMessage errorDict} {
    set errorIndex [lindex [dict get $errorDict -errorcode] 3]
    ...
}

Rejected alternatives

Report the error character

The original implementation reported the failing character/byte in the error message. This may be a control character corrupting a terminal view. IMHO error messages should be in the printable ASCII character set. Therefore this was removed, the character/byte is now only reported in hexadecimal notation.

--stoponerror as boolean option

An alternative would be to use -stoponerror 1 in stead of -stoponerror and -stoponerror 0 in stead of -nothrow. This only makes the command longer, without much benefit.

EILSEQ POSIX error code

Recent changes to TCL use the POSIX error EILSEQ: "invalid byte sequence", which looks like the most appropriate error message. Nevertheless, the POSIX message does not allow to return the error position.

Alternate solutions

See [607], which is not actually an alternative, but can be added later on top of this TIP.

Implementation

Implementation is in Tcl branch encodings-with-flags

Compatibility

The implementation is fully backward compatible for 8.7. There is a compatibility break for TCL 9.0.

Credits

Thanks to Jan Nijtmans for idea and implementation.

Copyright

This document has been placed in the public domain.