TIP 654: Profiles for character encoding/decoding behaviour

Login
Author:		Nathan Coulter <[email protected]>
State:		Draft
Type:		Project
Vote:		Pending
Created:	10-Jan-2023
Tcl-Version:	8.7
Obsoleted-By:   656

Abstract

Previous attempts to articulate the options for handling non-conforming data for a character set encoding have resulted in a set of available options that are not well defined. This TIP articulates the fundamental optional behaviours and proposes a new set of names for them.

Definitions

Non-conforming representation

One or more bytes that do not conform to the specification for the representation of code points in the encoding.

Non-conforming code point

One or more bytes that conform to the specification for the representation of code points in the encoding, but whose represented code points do not conform to the rules for the encoding.

Non-conforming data

Both non-conforming repesentations and non-conforming code points.

Specification

-nocomplain is no longer an option.

The "encoding" value of encoding converfrom, encoding convertto, and chan configure -encoding, is a dictionary (or at least conceptually one). The first key in the dictionary is optional, and if ommitted, it is "name". The "name" key provides the name of the encoding. Examples:

    chan configure $chan -encoding utf-8
    chan configure $chan -encoding {name utf-8}
    chan configure $chan -encoding {utf-8 profile strict}
    chan configure $chan -encoding {name utf-8 profile strict}

chan configure $chan -encoding returns the name of the encoding for the channel, as it always has.

chan configure $chan -encoding* returns a dictionary describing the configuration of the encoder for the channel.

The "profile" key provides the encoding profile. Each profile is independent of the other, and activating one profile cancels any previous active profile. The profiles are identified below by the options that activate them. For each option there is a corresponding channel configuration option prefixed with the word "encoding".

The profiles are:

discard

Not strict. Discards non-conforming data by omitting them from the output.

surrogate

Not strict. Each byte of non-conforming data is transformed into a single low surrogate code point that can be transformed back to the original byte, as described in Unicode Security Considerations This accomplishes the same purpose as tag, but requires only one character per byte instead of two.

pass

Not strict. Each non-conforming byte becomes the character having the Unicode code-point represented by that single byte.

replace

Not strict. When converting an encoded byte sequence to a Tcl string, invalid byte sequences are replaced by the U+FFFD REPLACEMENT CHARACTER code point.

When encoding a Tcl string, characters that cannot be represented in the target encoding are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT CHARACTER for UTF targets and generally ? for other encodings.

report

Not strict. Like pass, but result errors in the return options is a dictionary where each key is the starting index in the result of noncomforming substring and each value is the corresponding ending index.

strict

Strictly conform to the specification for the encoding. It is an error for non-conforming data to occur.

tag

Not strict. Like pass, but tags each non-conforming byte by prefixing it with a replacement character, which is normally the standard replacement character for the encoding. Each occurrence of the replacement character itself is also prefixed with the replacement character.

tcl8

The same as pass, but may in the future diverge if it is discovered that Tcl 8 behaviour does not mirror that described for pass.

Rationale

-strict was introduced in TIP 346, which focused narrowly on issues surrounding byte arrays and non-mappable characters. It should instead have focused on conformance to the chosen encoding, which is more fundamental. -nocomplain, was subsequenctly introduced in TIP 601, but did not describe its relationship to -strict, turned out to be nothing more than the inverse of -strict, and has already been eliminated in the implementation of Tcl's internal encoding/decoding functions.

Under this proposal, the syntax for specifying an encoding and its options is the same for both encoding convertto/from and chan configure -encoding, which simplifies the interface.

Implementation

Implementation is in progress.

Copyright

This document has been placed in the public domain.