Author: Harald Oehlmann <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 12-Aug-2022
Tcl-Version: 8.7
Tcl-Branch: tip633-tcl9-fconfigure-strictencoding
Keywords: channel encoding
Vote-Summary Accepted 4/0/1
Votes-For: JN, KW, MC, SL
Votes-Against: none
Votes-Present: FV
Abstract
Allow to switch between channels in respect of data encoding issues to throw an error or recover by data manipulation.
Rationale
In channel data read or write, encoding errors may arise. There are two ways to handle encoding errors:
- Throw an error on the corresponding commands
- Modify the data to continue without error.
TCL until version 8.7 modified the data and did not throw any error. TCL starting from version 9 will throw an error on encoding issues.
Both points have their use-cases. The purpose of this TIP is to be able to switch between the possibilies.
This is in accordance with the changes of the encoding convertto/from extension by TIP 601. TIP 601 added an option -nocomplain to activate data modification mode.
Error types
There are 3 types of possible errors, exercised in tests io-75.1 to io-75.6:
Invalid multi byte sequence read
The example is to use an UTF-8 byte "0xC0" which announces a multi-byte sequence and requires a following byte > "0x80". Test 75.1/75.6 uses the invalid sequence "\xC0\x40". It is written in a file and read with utf-8 encoding.
When modified data mode is active (-nocomplain), the data is returned as byte data "\xC0\x40" (test 75.1). When error throwing mode (default mode) is active, an error is thrown (test 75.6).
Test 75.4 and 75.9 also exercises this case with shiftjis encoding.
Unrepresentable character write
A character unrepresentable by the current encoding is written to a file. The example in test io-75.2/75.7 is to write "\u2022" to an iso8859-1 encoded channel, which does not allow any unicode points above 0x255.
When modified data mode is active (-nocomplain), a question mark (?) is written. When error throwing mode (default mode) is active, an error is thrown (test 75.7).
Incomplete multi byte sequence read
The example is to use an UTF-8 byte "0xC0" which announces a multi byte sequence as last character. Test 75.3/86.8 uses the sequence "xC0" at file end and reads it with utf-8 encoding.
Tolerant encoding returns returns this data as byte data "xC0". Strict encoding should cause an error. When modified data mode is active (-nocomplain), the data as byte data "xC0" is returned. When error throwing mode (default mode) is active, an error is thrown (test 75.8).
Test 75.5 and 75.10 also exercises this case with shiftjis encoding.
Specification
Extend the channel configuration command fconfigure by the following item: -nocomplaincoding bool. Bool is a boolean value which stands for throw error mode for value 0 and modify data mode for value 1.
The default value is "1" for TCL 8.7 and "0" for TCL 9.0.
The option is added to TCL 8.7 and TCL 9.0. In TCL 8.7, it is an error to set the value to false.
Example
fconfigure $handle -encoding utf8 -nocomplainencoding 1
Implementation
Implementations started for the two branches:
- TCL 8.7: tip633-fconfigure-tolerantencoding
- TCL 9.0: tip633-tcl9-fconfigure-strictencoding
Discussion
Discussion took place at the Vienna TCL meeting and at the August TCL conference call. Four TCL wizards expressed the necessity of this functionality in those meetings.
The name "-nocomplainencoding" is taken from the -nocomplain option of encoding convertfrom/to.
Alternate script-only solution
Jan wrote on the core list 2022-09-15 "Re:TIP#346 utf8-strict":
Well, there is another way to get the 8.x behavior back in Tcl 9.0.
Suppose we have a channel $f from which the byte \x80 is available. Consider the following code:
fconfigure $f -encoding ascii read $f
this will result in '?' on Tcl 8.x, it will throw an exception in Tcl 9.0 using your proposal:
fconfigure $f -nocomplainencoding 0 -encoding ascii read $f
But there's another way, which already works now:
fconfigure $f -encoding binary encoding convertfrom -nocomplain ascii [read $f]
So, instead of adding "-nocomplainencoding" to the channel, just set the channel in binary mode and use the "-nocomplain" option from "encoding convertfrom". There is little added value, building "-nocomplainencoding" into the channel code, when there is an alternative.
(end of quote).
This is totally valuable. But the TIP also has the value of symetry and ease of understanding to the script programmer. And there are other use-cases like the fcopy command, where there is no easy replacement. And there might be advantages in stacked channels, pipes or whatever application of the channel system. IMHO, the TCT may decide, if this TIP should be implemented.
Ashok added the following remark by the core list on 2022-09-16:
This might be fine when reading the entire channel content but is not convenient in streaming mode.
A minor nit - something like -looseencoding might be a better name for -nocomplainencoding.
TIP 346: Error on Failed String Encodings
TIP 346 by Alexandre Ferrieux is (in its current modified form) independent on this TIP.
TIP 346 defines a strict mode, if some compatibility cases are treated as encoding errors or not. This TIP tells what happens to encoding errors, a) error reporting or data modification. I suppose, that TIP 346 strict mode does not make any sense in combination with no error reporting. The strict mode is more a tool what may evolve in future.
Copyright
This document has been placed in the public domain.