TIP 607: -failindex option for encoding convertto/convertfrom

Login
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        30-Jul-2021
Post-History:
Keywords:       Tcl
Tcl-Version:    8.7
Tcl-Branch:     tip607-encoding-failindex
Vote-Summary:   Accepted 5/0/0
Votes-For:      AK, FV, JN, KW, SL
Votes-Against:  none
Votes-Present:  none

Abstract

This TIP proposes to add a -failindex option to encoding convertto/convertfrom. The implementation brings TIP [601] to the script level. In case of untransformable data, the error location and the so far transformed string may be retrieved.

Rationale

Please refer to TIP [601] for usage examples and use-cases. This tip was extracted from there, but the rationale and many descriptions also hold for this TIP.

Remark, that the wish in use-case 1 to return the so far encoded data is fullfilled by this TIP.

Option name

The option name -failindex is inspired by the option of the string is command with the same name and similar functionality.

Distinguish between error types "incomplete multi-byte sequence" and "not encodable character"

See TIP [601] Example 1 and 2 for the explanation of the two error types.

The two error types are bound to the used command:

In consequence, no feed-back of the error type is required. The error position is sufficient.

TCL 8.7 and TCL 9.0

In TCL 8.7, this interface is the only way to get informed about encoding errors.

In TCL 9.0, the default behaviour is to fail on any encoding errors. So, this interface may also be helpful to prepare TCL 8.7 scripts for TCL 9.0 and to check where TCL 9.0 would fail.

Specification

New option

The command is extended by a -failindex option:

encoding convertfrom ?-failindex posvar? ?encoding? data
encoding convertto ?-failindex posvar? ?encoding? data

The distinct behaviour is as follows:

The definition of the value written by the -failindex option is given in TIP [601] as "Error position".

This option may not be used together with the TCL encoding option -nocomplain of TIP [601]. Any attempt to use -nocomplain and -failindex simultaneously is an error case.

Credits

The proposal was initiated by a post by Andreas Leitgeb 2021-05-12 on the core list.

Discussion after vote

Error types

Harald Oehlmann 2023-01-13: The upper statement, that the error types depend on the conversion direction is wrong. The command encoding convertfrom may have both error types. See the following examples:

Example for 'Not convertable character'

 encoding convertfrom -failindex Pos utf-8 A\xC4\x01Z

The byte 'xC4' announces a multi byte sequence. The following byte must be above x7F, what is not the case.

The command will return 'A' with value 1 in variable 'Pos'.

Example for 'Incomplete multi-byte sequence'

 encoding convertfrom -failindex Pos utf-8 A\xC4

The byte 'xC4' announces a multi byte sequence. Then, nothing follows.

The command will return 'A' with value 1 in variable 'Pos'.

It would really be helpful, if one could make the distinction of the two errors from the return value. The underlying C routines know the difference. It is just not exposed to the script level.

Note, that this issue is not present in the channel interface. Channels always buffer incomplete sequences and never return partial data.

Copyright

This document has been placed in the public domain.