TIP 653: Handling encoding errors for <code>read</code> and <code>gets</code>

Login
Author:		Nathan Coulter <[email protected]>
State:		Draft
Type:		Project
Vote:		Pending
Created:	08-Jan-2023
Tcl-Version:	8.7
Tcl-branch:	tip-653
Vote-Summary:	
Votes-For:	
Votes-Against:	
Votes-Present:	

Abstract

In recent versions of Tcl it is possible to configure a channel to treat data that does not conform to the encoding specification as an error. read and gets must pass along such an error. In order to maintain expected semantics for a blocking channel, and to maximize utility, read and gets can use the return options dictionary to communicate additional information when an error occurs.

Specification

When read and gets must return an encoding error on a blocking channel, the error is returned when it is encountered, not on some subsequent call, and the data successfully decoded up to the point of the error is available in the return options dictionary either under the path -result read, or under the path -data, (to be determined). The advantage of -result read is that additional information could be added under the same key. For example, -result bytes might contain the original bytes prior to decoding.

After such an error, the current access position in the channel is the position of the first byte of the data that caused the error. [tell] provides that position.

Channels which use the (default) strict profile, now return the POSIX error EILSEQ when an encoding error occurs. For maximum compatibility with current behavior, a distinction is made for 'blocking' resp. 'non-blocking' mode.

In 'blocking' mode, the functions Tcl_Read()/Tcl_ReadObj() and Tcl_Gets()/Tcl_GetsObj() set the POSIX error EILSEQ whenever an encoding error occurs. If Tcl_Gets()/Tcl_GetsObj() encounter an encoding error, the file-pointer is left at the original position, and the functions return -1. Tcl_Read()/Tcl_ReadObj() store the data as received so far in the return options dictionary, and the file pointer is left where the encoding error occurred.

In 'non-blocking' mode, all data prior to the first byte that resulted in an encoding error is returned, and the POSIX error is not yet set. On the the next call to Tcl_Read()/Tcl_ReadObj(), which normally happens in a loop or as a readable event, no data is returned and the POSIX error EILSEQ is set. This makes it possible to handle all data up to the point of the error normally.

The functions Tcl_Write()/Tcl_WriteObj() and Tcl_Eof() don't depend on blocking mode. Tcl_Write() always writes out as many characters it can, and always sets POSIX error EILSEQ when it cannot write more due to an encoding error. Tcl_Eof() only returns true when the channel is at an EOF condition, not when the channel is at an encoding error position.

Rationale

The primary intent is to preserve current semantics of read and gets for a blocking channel: An error occurs immediately when non-conforming data is encountered, not on the next call to read or gets, as was proposed in some other approaches. The second goal is to make the position of the non-conforming data available to the caller. One natural way to do this is to make it the current position so that [tell] can provide it. The question then arises: What to do with the data that has been successfully decoded so far? The most simple and probably best answer is to make it available to the caller in case something useful can be done with it.

In Tcl the return value in case of an error is normally an error message, so the return value is not available for passing to the caller other information related the error. -errorcode could be used, but it is typically used for classification of the error, and mixing in other types of additional information does not seem like a particularly good idea.

The data successfully decoded so far is stored under the path -result read rather than just -result so that if there later arises a need to return other information, it can be assigned to another key under -result. For example, one idea is that the original undecoded bytes should also be returned. -result could become a common pattern for returning rich data in exceptional cases.

Under this proposal the caller of read and gets can handle each occurence of non-conforming data and then continue to read data from the channel.

Implementation

The py-b8f575aa23 branch contains a complete implementation under which the entire test suite passes.

Copyright

Copyright © 2023, Nathan Coulter. All rights reserved.

Support

The author of this TIP requests financial support for this and other free software work. Contact and payment information available at:

https://wiki.tcl-lang.org/page/Poor+Yorick