Author: Nathan Coulter <[email protected]>
State: Draft
Type: Project
Vote: Pending
Created: 10-Jan-2023
Tcl-Version: 8.7
Tcl-branch:
Vote-Summary:
Votes-For:
Votes-Against:
Votes-Present:
Abstract
Previous attempts to articulate the options for handling non-conforming data for a character set encoding have resulted in a set of available options that are not well defined. This TIP articulates the fundamental optional behaviours and proposes a new set of names for them.
Definitions
Non-conforming encoding representation
One or more bytes that do not conform to the specification for the representation of code points in the encoding.
Non-conforming code point
One or more bytes that conform to the specification for the representation of code points in the encoding, but whose represented code points do not conform to the rules for the encoding.
Non-conforming data
Both non-conforming encoding repesentations and non-conforming code points.
Specification
-nocomplain
is no longer an option.
Each profile is independent of the other, and activating one profile cancels any previous active profile. The profiles are identified below by the options that activate them. For each option there is a corresponding channel configuration option prefixed with the word "encoding". The profiles are:
-discard
and -encodingdiscard
Not strict. Discards non-conforming data by omitting them from the output.
-surrogate
and -encodingsurrogate
Not strict. Each byte of non-conforming data is transformed into a single low surrogate code point that can be transformed back to the original byte, as described in Unicode Security Considerations This accomplishes the same purpose as
-tag
, but requires only one character per byte instead of two.
-tag
and -encodingtag
Not strict. Like
-pass
, but tags each non-conforming byte by prefixing it with a replacement character, which is normally the standard replacement character for the encoding. Each occurrence of the replacement character itself is also prefixed with the replacement character.
-pass
and -encodingpass
Not strict. Each non-conforming byte becomes the character having the code-point represented by that single byte.
-replace
and -encodingreplace
Not strict. Each sequence of nonconforming data up to the next character boundary is replaced with a single replacement character for the encoding.
-report
Not strict. Like
-pass
, but-result errors
in the return options is a dictionary where each key is the starting index in the result of noncomforming substring and each value is the corresponding ending index.
-strict
and -encodingstrict
Strictly conform to the specification for the encoding. It is an error for non-conforming data to occur.
Rationale
-strict
was introduced in TIP
346
, which focused
narrowly on issues surrounding byte arrays and non-mappable characters. It
should instead have focused on conformance to the chosen encoding, which is
more fundamental. -nocomplain
, was subsequenctly introduced in TIP
601
, but did not
describe its relationship to -strict
, turned out to be nothing more than
the inverse of -strict
, and has already been eliminated in the
implementation of Tcl's internal encoding/decoding functions.
Implementation
On branch
[trunk-encodingdefaultstrict(https://core.tcl-lang.org/tcl/timeline?r=trunk-encodingdefaultstrict),
the internal TCL_ENCODING_NOCOMPLAIN has been eliminated, -nocomplain
is
eliminated, and -encodingstrict
is available. The additional described
options and behaviour will subsequently be implemented.
Copyright
This document has been placed in the public domain.