TIP 657: Make "-profile strict" the default in Tcl 9.0

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Tcl-Version:    9.0
Tcl-Branch:     tip-657

Abstract

This TIP proposes to make "-profile strict" the default. This was previously (but with a different approach) proposed in TIP #601, but the implementation didn't match the TIP text. This TIP is intended as replacement for TIP #601, but builds on top of TIP #656 ("A revised proposal for encodings")

An important part missing from TIP #601 is the Compatibility section, which should have been much more clear about the implications of the change.

Rationale

The tcl8 profile is a legacy profile, which doesn't conform to any recommended behavior, the two other profiles strict and replace do.

Since strict is the most desired profile, it becomes the default in Tcl 9.0. That has some implications at the script level and also in the C API. Many "http" testcases fail (without further measures), because they depend on the "tcl8" profile never throwing exceptions. Many scripts will have to be adapted, either expecting exceptions for encoding errors or setting the channel profile to "tcl8" or "replace". And functions like "fcopy", "read" and "gets" will throw exceptions in more situations than before.

Specification

Passing the TCL_ENCODING_STOPONERROR flag to Tcl_ExternalToUtfDStringEx(), Tcl_UtfToExternalDStringEx(), Tcl_ExternalToUtf() and Tcl_UtfToExternal(), causes these functions to report any encoding error that occurs (in the C API represented as the EILSEQ POSIX error). In Tcl 9.0, the behaviour indicted by the flag TCL_ENCODING_STOPONERROR becomes the default, and the flags TCL_ENCODING_PROFILE_TCL8 TCL_ENCODING_PROFILE_REPLACE both prevent prevent any exceptions from being thrown.

A new function, Tcl_InputEncodingError(), may be used instead of checking for the EILSEQ posix error. If Tcl_InputEncodingError() returns 1, then the current position of the channel is the position of where an encoding error occurred, and any follow-up 'read' (or 'gets') would return no data but result in a EILSEQ POSIX error. This condition can be reset by changing the '-encoding' or the '-profile' of the channel.

Backporting to Tcl 8.7

The function Tcl_InputEncodingError() will be backported to Tcl 8.7. It's only useful if the channel is set to "-profile strict".

Also, in Tcl 8.7, the "-failindex" option will be changed to work the same as in Tcl 9.0: If "-failindex" is specified, but "-profile" is not specified in the "encoding convertfrom/convertto" command, then the "strict" profile will be assumed.

Compatibility

This is an incompatible change for Tcl_ExternalToUtf()/Tcl_UtfToExternal(). But since those functions are rarely used (and when they are used, they often already have the TCL_ENCODING_STOPONERROR flag set already), it will have little effect.

This is also an incompatible change for Tcl_Read(), Tcl_Write(), Tcl_Gets(). For channels which have the (default) strict profile, they can now return a POSIX error EILSEQ when an encoding error occurs. For maximum compatibility with current behavior, a distinction is made for 'blocking' resp. 'non-blocking' mode.

In 'blocking' mode, the functions Tcl_Read()/Tcl_ReadObj() and Tcl_Gets()/Tcl_GetsObj() set the POSIX error EILSEQ whenever an encoding error occurs. They also return the data as received so far, and the file pointer will be left where the encoding error occurred. If there was left-over data, received before encountering the encoding error, this data will be left in the "-data" return option.

In 'non-blocking' mode, if there is any data returned before the encoding error, the POSIX error will not be set yet, so the channel has a chance to handle the data so far normally. Next call to Tcl_Read()/Tcl_ReadObj()/Tcl_Gets()/Tcl_GetsObj() (which normally happens in a loop or as a readable event) will return no data but only the POSIX error EILSEQ.

The functions Tcl_Write()/Tcl_WriteObj() and Tcl_Eof() don't depend on blocking mode. Tcl_Write() will always write out as many characters it can, and always sets POSIX error EILSEQ when it cannot write more due to an encoding error. Tcl_Eof() will only return true when the channel is at an EOF condition, it will return false when the channel is at an encoding error position.

The 'http' package is modified because of this change: Since the 'http' package is not prepared to handle exceptions, it can easily be left in an inconsistent state, as shown by test-case errors when the default profile was changed to 'strict'. Therefore, the 'http' package, when run in Tcl 9.0, will use the 'replace' profile. This makes the package conformant to the W3C recommendations.

The 'tcltest' package is modified to use the 'tcl8' profile for its internal channels. For this package, we don't want exceptions to disturb test-outputs. If a test-case wants to handle a surrogate, so be it, this should not disturb the testcase.

Copyright

This document has been placed in the public domain.