Author:		Ashok P. Nadkarni <[email protected]>
State:		Draft
Type:		Project
Created:	2023-06-04
Tcl-Version:	9.1
Tcl-branch:	tip-671

Abstract

This TIP proposes

a new encoding profile, lossless, that will preserve roundtripping in the presence of invalid byte sequences, and
use of this new profile in various system interfaces such as those related to file names, environment etc.

Note: Round tripping is not always possible for some encoding even in the absence of invalid bytes. This is because the mapping between Unicode and those encodings is one->many. Examples are shiftjis and big5.

Rationale

The following examples provide the motivation for this TIP.

Unix file systems treat file paths as simple byte sequences while Tcl expects them to be in the encoding returned by the encoding system command. Further, the encoding in effect at the time a file is created by an application may differ from that when the file is later accessed (different user, mounted file systems etc.) which may result in the latter interpreting the file name as containing invalid bytes for the encoding. This results in anomalous behavior as illustrated by the following sequence wherein the file exists command claims a file returned by glob does not exist. Other commands like open, file rename etc. will also show analogous behavior.

% mkdir tmp
% encoding system iso8859-1
% cd tmp
% close [open \xe9 w]
% exec ls -b
\351
% file exists [lindex [glob *] 0]
1
% encoding system utf-8
% file exists [lindex [glob *] 0]
0

Similar issues exist with environment variables. For example, when passed down to child processes two environment variables compare equal when they should not.

apnadkarni@IO$ export X=$'\351'
apnadkarni@IO$ export Y=$'\303\251'
apnadkarni@IO$ if [ "$X" = "$Y" ]; then echo Equal; else echo Unequal; fi
Unequal
apnadkarni@IO$ ./tclsh
% string equal $env(X) $env(Y)
1

Misinterpretation of environment variables can affect searches along PATH etc.

And in command line arguments:

apnadkarni@IO$ echo 'puts [string equal {*}$argv]' > x.tcl
apnadkarni@IO$ tclsh x.tcl $'\351' $'\303\251'
1

So for example, a file name passed down to tclsh from the find program will not target the correct file (most likely failing).

Note the TIP is not a panacea for all the problems related to encodings in system interfaces. No general solution is possible and this TIP only addresses certain common situations.

Relation to other TIP's

This TIP is orthogonal to 657 and 667 which propose changing the default encoding profile to strict. Neither of those TIP's currently specify the encoding profiles implicitly used by commands like glob, open etc. that interface to system API's so the assumption is behavior will remain as in the examples above. If those TIP's are updated to mandate the strict profile for those commands, the problems are only exacerbated. For example, the glob command in the example above will raise an error exception making those directories completely unreadable.

Specification

This specification is based on the approach described in Unicode TR #36 Section 3.7 Enabling Lossless Conversion and Python's PEP 383.

The lossless profile

A new encoding profile, named lossless, is defined which can be specified anywhere that encoding profiles are accepted. When this profile is in effect for ASCII-compatible encodings (those matching ASCII in the range 0:127),

Passthrough decoding transform When converting an encoded byte stream to a Tcl string (Unicode code point sequence), invalid bytes in the range 0x80-0xFF are mapped to Unicode code points U+DC80-U+DCFF. In ASCII-compatible encodings invalid bytes can only lie in this range.
Passthrough encoding transform When converting a Tcl string to an encoded byte stream, code points in the range U+DC80-U+DCFF are mapped to bytes values 0x80-0xFF. Code points not supported by the encoding are replaced with a encoding-specific fallback character as for the tcl8 and replace profiles.

For encodings that are not ASCII-compatible,

When converting an encoded byte stream to a Tcl string, invalid bytes are mapped to the Unicode REPLACEMENT CHAR U+FFFD.
When converting a Tcl string to an encoded byte stream, code points in the range U+DC80-U+DCFF or not supported by the encodings are mapped to the encoding dependent fallback character.

The rationale for the distinction between ASCII-compatible encodings and those that are not is detailed in the Discussion section.

Lossless roundtripping using the lossless is only guaranteed when the same encoding is used for input and output. Writing using a different encoding from the one used for reading will naturally not be lossless as the invalid byte in the input encoding that was "passed through" may very well be a valid byte in the output encoding. In practice, this is not likely as the lossless profile is generally in effect in the system commands which implicitly use the system encoding for both encoding and decoding.

For illustrative purposes, the table below shows how the different profiles behave in their treatment of invalid bytes in an encoded UTF-8 stream.

Profile   Encoded input   Tcl string             Encoded output
                   decoding->               encoding->
tcl8      \x41\xe1\x42    U+0041,U+00E1,U+0042   \x41\xc3\xa1\x42
strict    \x41\xe1\x42    * raises error *
replace   \x41\xe1\x42    U+0041,U+FFFD,U+0042   \x41\xef\xbf\xbd\x42
lossless  \x41\xe1\x42    U+0041,U+DCE1,U+0042   \x41\xe1\x42

Only the lossless profile preserves the original byte sequence after roundtripping.

Encoding strings

The encoding convertfrom and encoding convertto commands will accept lossless as a -profile option value and implement the appropriate behavior described earlier depending on direction of conversion.

Analogously, the lossless profile can be effected in public C encoding API's that accept profiles via their flags parameter by specifying the TCL_ENCODING_PROFILE_LOSSLESS flag. These functions are Tcl_ExternalToUtf, Tcl_UtfToExternal, Tcl_ExternalToUtfDStringEx, Tcl_UtfToExternalDStringEx.

Channel configuration

Likewise, channels configured with -profile lossless via fconfigure or chan configure commands will perform lossless encoding transforms on data passed through the channel.

Implicit use of lossless profiles

Affected file systems

The changes to encoding transforms implicitly used in commands that call system API's only affects platforms that use the Unix/POSIX API's. In particular,

Windows is not impacted as its system API's use wide characters strings natively and are not byte streams.
The zipfs file system uses its own fallback strategy and will not be changed under the presumption that the fallbacks implemented there have been tuned to common usage in the zip world.

File paths

Commands that transfer file paths to or from the system will implicitly use the system encoding with the lossless profile. These include open, cd, pwd, open, exec load, info nameofexecutable as well as the file and chan ensembles where applicable.

Note in the case of exec and the pipe version of open, passed arguments are also encoded with the lossless profile.

The equivalent C API's for file access will also use lossless profiles when translating file names into native form. This includes the internal ProcessGlobalValue API's that are used to share native strings across threads (executable name, encoding and library paths, host names).

Global variables

Values read from the environment and stored in env at program start up will be transformed using the system encoding and the lossless profile. Writes to the env will also use the same combination when storing into the native environment.

Command line arguments stored to argv at program start up will be transformed using the system encoding and lossless profile.

As for file paths, this only affects platforms that use the Unix/POSIX API's.

Internal string representation

The conversions between the different Tcl's internal string representations would need to allow for isolated low surrogates. This is already the case but would need to continue to be so even if TIP 657 (which is silent on the matter) passes.

Error exceptions

There are code paths within the Tcl core where there is no mechanism for reporting or handling errors. With exception of memory allocation failures (which result in a panic) encoding operations are always expected to succeed which was true with the existing tcl8 profile. The current lossless profile implementation also adheres to this.

Discussion

Security issues

The ability to smuggle invalid byte encodings to and from Unicode can lead to security issues when a Tcl string that was decoded from a byte sequence using encoding A is then encoded using encoding B. The byte that was invalid in encoding A might be valid in encoding B and with special security implications (path separator etc.). This is a programming error as roundtripping should always be done using the same encoding. Nevertheless, to mitigate this, this specification (following PEP 383** will not map byte values < 128 into the U+DC00 surrogate space. Instead they are mapped to the encoding specific replacement character.

Since values < 128 are valid for all ASCII compatible encodings, there is no need to map them and thus this is generally not an issue. For encodings that are not ASCII compatible, such as EBCIDIC, UTF-16 and UTF-32 roundtripping is thus not possible as they will contain invalid bytes with values < 128 which will be replaced by a fallback character.

In practice, this limitation is of little consequence because the primary use of the profile is across system interfaces and encodings not compatible with ASCII are highly discouraged in the POSIX environments this TIP targets. To quote from here

There are only 19 encodings currently used worldwide as legitimate POSIX multi-byte locale encodings UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, EUC-JP, EUC-KR, GB2312/EUC-CN, KOI8-R, KOI8-U, VISCII, WINDOWS-1251, WINDOWS-1256.

Alternative mappings

There are other code point ranges that could have been used to map invalid bytes such as code points above U+10FFFF, private use code points, private use high surrogates etc. The primary reason for choosing U+DC00-U+DCFF was that private use code points may conflict with some application that uses private use code points for their own purpose. At the end of the day, it was felt safer to stick to the PEP 383 choice as that has been around for many years and (presumably) survived conflicts in real world use.

Tk compatibility

If file names containing "wrapped" invalid bytes are displayed in a widget, the bytes will be displayed using the replacement character glyph.

TIP 671: New encoding profile - lossless