TIP 633: configure channel about channel encoding error reporting mode

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:		Harald Oehlmann <[email protected]>
State:		Draft
Type:		Project
Vote:		In progress
Created:	12-Aug-2022
Tcl-Version:	8.7
Tcl-Branch:	tip633-tcl9-fconfigure-strictencoding
Keywords:	channel encoding

Abstract

Allow to switch between channels in respect of data encoding issues to throw an error or recover by data manipulation.

Rationale

In channel data read or write, encoding errors may arise. There are two ways to handle encoding errors:

TCL until version 8.7 modified the data and did not throw any error. TCL starting from version 9 will throw an error on encoding issues.

Both points have their use-cases. The purpose of this TIP is to be able to switch between the possibilies.

This is in accordance with the changes of the encoding convertto/from extension by TIP 601. TIP 601 added an option -nocomplain to activate data modification mode.

Error types

There are 3 types of possible errors, exercised in tests io-75.1 to io-75.6:

Invalid multi byte sequence read

The example is to use an UTF-8 byte "0xC0" which announces a multi-byte sequence and requires a following byte > "0x80". Test 75.1/75.6 uses the invalid sequence "\xC0\x40". It is written in a file and read with utf-8 encoding.

When modified data mode is active (-nocomplain), the data is returned as byte data "\xC0\x40" (test 75.1). When error throwing mode (default mode) is active, an error is thrown (test 75.6).

Test 75.4 and 75.9 also exercises this case with shiftjis encoding.

Unrepresentable character write

A character unrepresentable by the current encoding is written to a file. The example in test io-75.2/75.7 is to write "\u2022" to an iso8859-1 encoded channel, which does not allow any unicode points above 0x255.

When modified data mode is active (-nocomplain), a question mark (?) is written. When error throwing mode (default mode) is active, an error is thrown (test 75.7).

Incomplete multi byte sequence read

The example is to use an UTF-8 byte "0xC0" which announces a multi byte sequence as last character. Test 75.3/86.8 uses the sequence "xC0" at file end and reads it with utf-8 encoding.

Tolerant encoding returns returns this data as byte data "xC0". Strict encoding should cause an error. When modified data mode is active (-nocomplain), the data as byte data "xC0" is returned. When error throwing mode (default mode) is active, an error is thrown (test 75.8).

Test 75.5 and 75.10 also exercises this case with shiftjis encoding.

Specification

Extend the channel configuration command fconfigure by the following item: -nocomplaincoding bool. Bool is a boolean value which stands for throw error mode for value 0 and modify data mode for value 1.

The default value is "1" for TCL 8.7 and "0" for TCL 9.0.

The option is added to TCL 8.7 and TCL 9.0. In TCL 8.7, it is an error to set the value to false.

Example

fconfigure $handle -encoding utf8 -nocomplainencoding 1

Implementation

Implementations started for the two branches:

Discussion

Discussion took place at the Vienna TCL meeting and at the August TCL conference call. Four TCL wizards expressed the necessity of this functionality in those meetings.

The name "-nocomplainencoding" is taken from the -nocomplain option of encoding convertfrom/to.

Alternate script-only solution

Jan wrote on the core list 2022-09-15 "Re:TIP#346 utf8-strict":

Well, there is another way to get the 8.x behavior back in Tcl 9.0.

Suppose we have a channel $f from which the byte \x80 is available. Consider the following code:

     fconfigure $f -encoding ascii
     read $f

this will result in '?' on Tcl 8.x, it will throw an exception in Tcl 9.0 using your proposal:

      fconfigure $f -nocomplainencoding 0 -encoding ascii
      read $f

But there's another way, which already works now:

      fconfigure $f -encoding binary
      encoding convertfrom -nocomplain ascii [read $f]

So, instead of adding "-nocomplainencoding" to the channel, just set the channel in binary mode and use the "-nocomplain" option from "encoding convertfrom". There is little added value, building "-nocomplainencoding" into the channel code, when there is an alternative.

(end of quote).

This is totally valuable. But the TIP also has the value of symetry and ease of understanding to the script programmer. And there are other use-cases like the fcopy command, where there is no easy replacement. And there might be advantages in stacked channels, pipes or whatever application of the channel system. IMHO, the TCT may decide, if this TIP should be implemented.

Ashok added the following remark by the core list on 2022-09-16:

This might be fine when reading the entire channel content but is not convenient in streaming mode.

A minor nit - something like -looseencoding might be a better name for -nocomplainencoding.

TIP 346: Error on Failed String Encodings

TIP 346 by Alexandre Ferrieux is (in its current modified form) independent on this TIP.

TIP 346 defines a strict mode, if some compatibility cases are treated as encoding errors or not. This TIP tells what happens to encoding errors, a) error reporting or data modification. I suppose, that TIP 346 strict mode does not make any sense in combination with no error reporting. The strict mode is more a tool what may evolve in future.

Copyright

This document has been placed in the public domain.