TIP 654: Profiles for character encoding/decoding behaviour

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:		Nathan Coulter <[email protected]>
State:		Draft
Type:		Project
Vote:		Pending
Created:	10-Jan-2023
Tcl-Version:	8.7
Tcl-branch:
Vote-Summary:	
Votes-For:	
Votes-Against:	
Votes-Present:	

Abstract

Previous attempts to articulate the options for handling non-conforming data for a character set encoding have resulted in a set of available options that are not well defined. This TIP articulates the fundamental optional behaviours and proposes a new set of names for them.

Definitions

Non-conforming encoding representation

One or more bytes that do not conform to the specification for the representation of code points in the encoding.

Non-conforming code point

One or more bytes that conform to the specification for the representation of code points in the encoding, but whose represented code points do not conform to the rules for the encoding.

Non-conforming data

Both non-conforming encoding repesentations and non-conforming code points.

Specification

-nocomplain is no longer an option.

Each profile is independent of the other, and activating one profile cancels any previous active profile. The profiles are identified below by the options that activate them. For each option there is a corresponding channel configuration option prefixed with the word "encoding". The profiles are:

-discard and -encodingdiscard

Not strict. Discards non-conforming data by omitting them from the output.

-surrogate and -encodingsurrogate

Not strict. Each byte of non-conforming data is transformed into a single low surrogate code point that can be transformed back to the original byte, as described in Unicode Security Considerations This accomplishes the same purpose as -tag, but requires only one character per byte instead of two.

-tag and -encodingtag

Not strict. Like -pass, but tags each non-conforming byte by prefixing it with a replacement character, which is normally the standard replacement character for the encoding. Each occurrence of the replacement character itself is also prefixed with the replacement character.

-pass and -encodingpass

Not strict. Each non-conforming byte becomes the character having the code-point represented by that single byte.

-replace and -encodingreplace

Not strict. Each sequence of nonconforming data up to the next character boundary is replaced with a single replacement character for the encoding.

-report

Not strict. Like -pass, but -result errors in the return options is a dictionary where each key is the starting index in the result of noncomforming substring and each value is the corresponding ending index.

-strict and -encodingstrict

Strictly conform to the specification for the encoding. It is an error for non-conforming data to occur.

Rationale

-strict was introduced in TIP 346, which focused narrowly on issues surrounding byte arrays and non-mappable characters. It should instead have focused on conformance to the chosen encoding, which is more fundamental. -nocomplain, was subsequenctly introduced in TIP 601, but did not describe its relationship to -strict, turned out to be nothing more than the inverse of -strict, and has already been eliminated in the implementation of Tcl's internal encoding/decoding functions.

Implementation

On branch [trunk-encodingdefaultstrict(https://core.tcl-lang.org/tcl/timeline?r=trunk-encodingdefaultstrict), the internal TCL_ENCODING_NOCOMPLAIN has been eliminated, -nocomplain is eliminated, and -encodingstrict is available. The additional described options and behaviour will subsequently be implemented.

Copyright

This document has been placed in the public domain.