TIP 656: A revised proposal for encodings

Login
Author:		Ashok P. Nadkarni <[email protected]>
State:		Final
Type:		Project
Vote:		Done
Created:	2023-02-13
Tcl-Version:	8.7
Tcl-branch:	tip-656
Vote-Summary:	Accepted 6/0/0
Votes-For:	AK, JD, JN, KW, MC, SL
Votes-Against:	none
Votes-Present:	none

Abstract

This TIP proposes enhancements to the character encoding commands and public C API's present in Tcl 8.6 based on based on the profile concept of TIP 654. It differs in terms of syntax, C API and semantics related to other options. It supplants previously accepted TIP's 346, 601 and 633 that targeted the same functionality.

The TIP also defines fconfigure options to associate profiles with channels to control their encoding behavior.

Rationale

Operations involving encoding transforms may encounter multiple types of errors such as invalid sequences in the source, characters that cannot be encoded in the target encoding etc. Tcl 8.6 dealt with these errors by "wishing them away", either by substituting a ? character or (effectively) assuming the offending byte was in CP1252 encoding.

TIP's 346, 601 and 607 proposed options to the encoding command that allowed applications to detect and handle encoding transform errors. Further, 633 added corresponding options to fconfigure for the same purpose.

There are however inadequacies in these options as described in a separate write-up and mailing list and summarized below.

This proposal based on TIP 654's profile model, is intended to address the above shortcomings.

Specification

Profiles

The following types of errors may be encountered when converting an encoded byte sequence into Tcl's internal form:

Similarly, the following types of errors may be encountered when converting in the other direction:

A profile defines the handling of each of the above error cases by either

  1. Terminating further processing of the source data. The profile does not determine how this premature termination is conveyed to the caller. By default, this is signalled by raising an exception. The -failindex option as described in TIP 607 may be used instead.

  2. Using a fallback strategy for the offending bytes and continuing processing the rest of the data. This may be use of a replacement character (either fixed or dependent on the invalid byte), discarding the invalid bytes etc. as defined by the profile.

Note that none of the currently defined profiles distinguish between errors cases but there is no reason preventing a profile defined in the future to do so. For example, a allowsurrogates profile may pass through surrogate code points (illegal in UTF-8) but stop processing on other error cases.

This TIP defines three profiles, tcl8, strict and replace.

The tcl8 profile

The tcl8 profile corresponds to the implementation of encoders in Tcl 8.6.

When converting to Tcl's string form, with the exception of the special case noted below, each byte of an illegal byte sequence is mapped to its numerically equivalent code point. In effect, it treats the byte as being in ISO8859-1 encoding even though the transform may have specified a different encoding.

As an special case, for the UTF-8 encoding the illegal sequence \xC0\x80 is mapped to U+000000.

When converting a Tcl string to an encoded byte sequence, values that cannot be encoded in the target encoding are mapped to an encoding-specific fallback character, usually ?. For UTF encodings, this case cannot arise as they can represent all code points. Additionally, for the error case where the code point being encoded is prohibited from appearing in encoded form (surrogates for example), the tcl8 profile ignores the mandate and encodes the code point anyways.

The tcl8 profile is not conformant with the Unicode standard. Moreover, it leaves room for silent misinterpretation of data.

With respect to the current implementation, the tcl8 profile replaces the -nocomplain option of TIP 601.

The strict profile

The strict profile implements strictly conformant behavior as defined in the Unicode standard. All error cases result in the error being signalled.

With respect to the current implementation, the strict profile replaces the -strict option of TIP 346.

The replace profile

The replace profile implements an alternate conformant behaviour defined in the Unicode standard.

When converting an encoded byte sequence to a Tcl string, invalid byte sequences are replaced by the U+FFFD REPLACEMENT CHARACTER code point.

When encoding a Tcl string, characters that cannot be represented in the target encoding are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT CHARACTER for UTF targets and generally ? for other encodings.

When multiple successive invalid bytes are encountered, the Unicode standard allows for their substitution with a single or multiple replacement characters. The replace profile conforms to this.

There is no equivalent to the replace profile in the current TIP 346/601 based 8.7 implementation.

The default profile

This TIP does not specify the default profile to be used. That is the subject of a separate TIP.

The encoding profiles command

A new command is added that will return the names of implemented profiles.

encoding profiles

Changes to encoding convertfrom and encoding convertto

The commands encoding convertfrom and encoding convertto support a new option profile that takes a profile name as value. The -strict and -nocomplain options are no longer supported. The commands take the form

encoding convertfrom DATA
encoding convertfrom ?-profile PROFILE? ?-failindex VAR? ENCODING DATA

encoding convertto DATA
encoding convertto ?-profile PROFILE? ?-failindex VAR? ENCODING DATA

The syntax is backward compatible with 8.6. However, it differs from the current 8.7/9.0 implementation in that there is no ambiguity. In the current implementation, when two arguments are supplied, it tries to guess whether the first is an option or an encoding name. With the above syntax, if any options are specified, the encoding must be explicitly specified as well. Note it is possible to relax this based on odd/evenness of the argument count but that would make it trickier to add options in the future that do not take an argument.

The -profile option specifies the profile to be used to be used for the conversion as described earlier. If multiple -profile options are passed, the last one will be used.

The -failindex option behaves as defined in TIP 607. However, although not specified in that TIP, in the current 8.7/9.0 implementation the -failindex option also enables the -strict option. This TIP specifically proposes that the option not make any implicit selection of profiles. In addition to the author's opinion that options should be as orthogonal to each other as possible, the current implied behavior makes it awkward to write (for example) a proc that takes a profile and returns as much data as can be read without raising an error. The -failindex option now only determines whether an exception is raised or decoded data is returned with error location in the passed variable when processing of the input data is stopped as determined by the profile.

New option -profile for fconfigure and chan configure

A new option -profile has been added to the fconfigure command. The option's value must be a profile name. The encoding transforms in use for the channel's input and output will then be subject to the rules of that profile. Any failures will result in a channel error. The mode of reporting channel error is a function of the channel subsystem and not defined by this TIP.

The -strictencoding and -nocomplainencoding options that were defined by the earlier TIP's and currently implemented in 8.7 and 9.0 alphas are supplanted by -profile and removed.

Changes to the C API's

Two new functions, Tcl_ExternalToUtfDStringEx and Tcl_UtfToExternalDString, related to encoding transforms were added by TIP 601 for 8.7. These had the signatures

Tcl_Size Tcl_ExternalToUtfDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr);
Tcl_Size Tcl_UtfToExternalDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr);

This TIP changes the signatures of these functions to the following:

int
Tcl_ExternalToUtfDStringEx(
    Tcl_Interp *interp,     /* For error messages. May be NULL. */
    Tcl_Encoding encoding,  /* The encoding for the source string, or NULL
                             * for the default system encoding. */
    const char *src,        /* Source string in specified encoding. */
    Tcl_Size srcLen,        /* Source string length in bytes, or < 0 for
                             * encoding-specific string length. */
    int flags,              /* Conversion control flags. */
    Tcl_DString *dstPtr,    /* Uninitialized or free DString in which the
                             * converted string is stored. Must be freed
                             * after return irrespective of return value */
    Tcl_Size *errorLocPtr); /* Where to store the error location
                               (or TCL_INDEX_NONE if no error). May be NULL. */
int
Tcl_UtfToExternalDStringEx(
    Tcl_Interp *interp,     /* For error messages. May be NULL. */
    Tcl_Encoding encoding,  /* The encoding for the converted string, or
                             * NULL for the default system encoding. */
    const char *src,        /* Source string in UTF-8. */
    Tcl_Size srcLen,        /* Source string length in bytes, or < 0 for
                             * strlen(). */
    int flags,              /* Conversion control flags. */
    Tcl_DString *dstPtr,    /* Uninitialized or free DString in which the
                             * converted string is stored Must be freed
                             * after return irrespective of return value */
    Tcl_Size *errorLocPtr); /* Where to store the error location
                              (or TCL_INDEX_NONE if no error). May be NULL. */

The Tcl_ExternalToUtfDStringEx function converts a source buffer from the specified encoding into UTF-8. The Tcl_UtfToExternalDStringEx function does the converse, converting a source buffer from UTF-8 to the specified encoding.

The flags parameter may be composed from OR-ing the following values:

For preserving future compatibility, any other bits will result in an error being returned. In particular, callers should not set the TCL_ENCODING_START and TCL_ENCODING_STOP flags as those are not supported by the above functions (even in the current pre-profile implementation) as they do not implement streaming operation.

Both functions have the same set of return values:

In the case of the TCL_CONVERT_* return codes,

Differences from the current 8.7 API

As stated above, the signatures of the functions differ from the currently implemented 8.7 and 9.0 API's. The new signatures are motivated by:

In addition to the change in signatures, the TCL_ENCODING_NOCOMPLAIN, TCL_ENCODING_STRICT and TCL_ENCODING_MODIFIED bit flags have been removed. These were not present in Tcl 8.6 so there is no backward compatibility issue.

The first two have been supplanted by the profile related flags. The TCL_ENCODING_MODIFIED bit was intended to be used to specify a variant of UTF-8 or CESU-8 for encoding nul bytes as \xC0\x80. This is never set internally within the Tcl core and not accessible at the script level either. The motivation of eliminating it arises from the belief that this is actually an encoding and best modeled as such instead of through flags. If encoding variants are enabled through flags, then why not CESU-8 as as variant of UTF-8, or UTF-16LE/UTF-16BE as variants of UTF-16? As an aside, other languages also treat this "modified" UTF-8 as a separate selectable encoding. A separate encoding would also make it usable from the script level if so desired without changing the API.

Implementation

Implementation and tests for Tcl 8.7 and 9.0 are available in the tip-656 and tip-656-tcl9 branches respectively. Currently these still use the -encodingprofile option name and will be changed to -profile dependent on TIP approval. Manpages have not been updated.

Alternative proposals

There have been a couple of alternatives proposed on the mailing list.

Finer granularity of error class selection

The first is an -onerror option which is similar to the -profile option but allows for finer granularity.

encoding convertfrom -onerror {surrogates invalid wrongcode} ....
encoding convertfrom -handle {SURROGATE error INVALID replace INCOMPLETE ignore ...}

The author is not in favor of this as I expect it to add considerable complexity to implementation and test suites while being minimally useful to the end user. (It feels like over generalization to me. How often would a user want to distinguish between invalid cases?).

Include the profile within the encoding parameter

Another syntactic alternative proposed was to embed the error handling options into the encoding argument.

encoding convertfrom {utf-8 strict}
fconfigure CHANNEL -encoding {utf-8 strict}

Since the difference is primarily in command option processing, implementation changes are not many. I prefer the first form from a stylistic perspective. For example, the latter makes it a little more awkard to request a profile using the default encoding.

Alternative fconfigure option name

The original option to fconfigure proposed by this TIP was -encodingprofile. That has been renamed to -profile as per Jan's suggestion. This is both less wordy and does not conflict with -encoding when used in shorter forms.

Copyright

This document has been placed in the public domain.