Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 30-Jul-2021
Post-History:
Keywords: Tcl
Tcl-Version: 8.7
Tcl-Branch: tip607-encoding-failindex
Vote-Summary: Accepted 5/0/0
Votes-For: AK, FV, JN, KW, SL
Votes-Against: none
Votes-Present: none
Abstract
This TIP proposes to add a -failindex option to encoding convertto/convertfrom. The implementation brings TIP [601] to the script level. In case of untransformable data, the error location and the so far transformed string may be retrieved.
Rationale
Please refer to TIP [601] for usage examples and use-cases. This tip was extracted from there, but the rationale and many descriptions also hold for this TIP.
Remark, that the wish in use-case 1 to return the so far encoded data is fullfilled by this TIP.
Option name
The option name -failindex is inspired by the option of the string is command with the same name and similar functionality.
Distinguish between error types "incomplete multi-byte sequence" and "not encodable character"
See TIP [601] Example 1 and 2 for the explanation of the two error types.
The two error types are bound to the used command:
- "incomplete multi-byte sequence" may only appear in encoding convertfrom
- "not encodable character" may only appear in encoding convertto
In consequence, no feed-back of the error type is required. The error position is sufficient.
TCL 8.7 and TCL 9.0
In TCL 8.7, this interface is the only way to get informed about encoding errors.
In TCL 9.0, the default behaviour is to fail on any encoding errors. So, this interface may also be helpful to prepare TCL 8.7 scripts for TCL 9.0 and to check where TCL 9.0 would fail.
Specification
New option
The command is extended by a -failindex option:
encoding convertfrom ?-failindex posvar? ?encoding? data
encoding convertto ?-failindex posvar? ?encoding? data
The distinct behaviour is as follows:
- No conversion error
- Option -failindex not given: Converted data returned as command result.
- Option -failindex given: Additionaly, the value -1 is written to the given variable in the caller scope.
- Conversion error present
- Option -failindex not given: In TCL 8.7 or in TCL 9.0 with -nocomplain option, the data is converted with replacement characters as currently done. Otherwise, an error message is thrown by the command (Error Code: EILSEQ) (see TIP [601]).
- Option -failindex given: The converted data until the failed index is returned as command result. The position of the conversion error in the source string is written to the specified variable in the caller scope.
The definition of the value written by the -failindex option is given in TIP [601] as "Error position".
This option may not be used together with the TCL encoding option -nocomplain of TIP [601]. Any attempt to use -nocomplain and -failindex simultaneously is an error case.
Credits
The proposal was initiated by a post by Andreas Leitgeb 2021-05-12 on the core list.
Discussion after vote
Error types
Harald Oehlmann 2023-01-13: The upper statement, that the error types depend on the conversion direction is wrong. The command encoding convertfrom may have both error types. See the following examples:
Example for 'Not convertable character'
encoding convertfrom -failindex Pos utf-8 A\xC4\x01Z
The byte 'xC4' announces a multi byte sequence. The following byte must be above x7F, what is not the case.
The command will return 'A' with value 1 in variable 'Pos'.
Example for 'Incomplete multi-byte sequence'
encoding convertfrom -failindex Pos utf-8 A\xC4
The byte 'xC4' announces a multi byte sequence. Then, nothing follows.
The command will return 'A' with value 1 in variable 'Pos'.
It would really be helpful, if one could make the distinction of the two errors from the return value. The underlying C routines know the difference. It is just not exposed to the script level.
Note, that this issue is not present in the channel interface. Channels always buffer incomplete sequences and never return partial data.
Copyright
This document has been placed in the public domain.