Author: Ashok P. Nadkarni <[email protected]>
State: Draft
Type: Project
Created: 2023-06-04
Tcl-Version: 9.1
Tcl-branch: tip-671
- Abstract
- Rationale
- Relation to other TIP's
- Specification
- Discussion
- Implementation
- References
- Copyright
Abstract
This TIP proposes
a new encoding profile, lossless, that will preserve roundtripping in the presence of invalid byte sequences, and
use of this new profile in various system interfaces such as those related to file names, environment etc.
Note: Round tripping is not always possible for some encoding even in the absence of invalid bytes. This is because the mapping between Unicode and those encodings is one->many. Examples are shiftjis and big5.
Rationale
The following examples provide the motivation for this TIP.
Unix file systems treat file paths as simple byte sequences while Tcl expects
them to be in the encoding returned by the encoding system
command. Further,
the encoding in effect at the time a file is created by an application may
differ from that when the file is later accessed (different user, mounted file
systems etc.) which may result in the latter interpreting the file name as
containing invalid bytes for the encoding. This results in anomalous behavior as
illustrated by the following sequence wherein the file exists
command claims a
file returned by glob
does not exist. Other commands like open
,
file rename
etc. will also show analogous behavior.
% mkdir tmp
% encoding system iso8859-1
% cd tmp
% close [open \xe9 w]
% exec ls -b
\351
% file exists [lindex [glob *] 0]
1
% encoding system utf-8
% file exists [lindex [glob *] 0]
0
Similar issues exist with environment variables. For example, when passed down to child processes two environment variables compare equal when they should not.
apnadkarni@IO$ export X=$'\351'
apnadkarni@IO$ export Y=$'\303\251'
apnadkarni@IO$ if [ "$X" = "$Y" ]; then echo Equal; else echo Unequal; fi
Unequal
apnadkarni@IO$ ./tclsh
% string equal $env(X) $env(Y)
1
Misinterpretation of environment variables can affect searches along PATH
etc.
And in command line arguments:
apnadkarni@IO$ echo 'puts [string equal {*}$argv]' > x.tcl
apnadkarni@IO$ tclsh x.tcl $'\351' $'\303\251'
1
So for example, a file name passed down to tclsh
from the find
program
will not target the correct file (most likely failing).
Note the TIP is not a panacea for all the problems related to encodings in system interfaces. No general solution is possible and this TIP only addresses certain common situations.
Relation to other TIP's
This TIP is orthogonal to
657 and
667 which propose
changing the default encoding profile to strict. Neither of those TIP's
currently specify the encoding profiles implicitly used by commands like glob
,
open
etc. that interface to system API's so the assumption is behavior will
remain as in the examples above. If those TIP's are updated to mandate the
strict
profile for those commands, the problems are only exacerbated. For
example, the glob
command in the example above will raise an error exception
making those directories completely unreadable.
Specification
This specification is based on the approach described in Unicode TR #36 Section 3.7 Enabling Lossless Conversion and Python's PEP 383.
The lossless profile
A new encoding profile, named lossless, is defined which can be specified anywhere that encoding profiles are accepted. When this profile is in effect for ASCII-compatible encodings (those matching ASCII in the range 0:127),
Passthrough decoding transform When converting an encoded byte stream to a Tcl string (Unicode code point sequence), invalid bytes in the range 0x80-0xFF are mapped to Unicode code points U+DC80-U+DCFF. In ASCII-compatible encodings invalid bytes can only lie in this range.
Passthrough encoding transform When converting a Tcl string to an encoded byte stream, code points in the range U+DC80-U+DCFF are mapped to bytes values 0x80-0xFF. Code points not supported by the encoding are replaced with a encoding-specific fallback character as for the
tcl8
andreplace
profiles.
For encodings that are not ASCII-compatible,
When converting an encoded byte stream to a Tcl string, invalid bytes are mapped to the Unicode REPLACEMENT CHAR U+FFFD.
When converting a Tcl string to an encoded byte stream, code points in the range U+DC80-U+DCFF or not supported by the encodings are mapped to the encoding dependent fallback character.
The rationale for the distinction between ASCII-compatible encodings and those that are not is detailed in the Discussion section.
Lossless roundtripping using the lossless
is only guaranteed when the same
encoding is used for input and output. Writing using a different encoding from
the one used for reading will naturally not be lossless as the invalid byte in
the input encoding that was "passed through" may very well be a valid byte in
the output encoding. In practice, this is not likely as the lossless
profile is
generally in effect in the system commands which implicitly use the system
encoding for both encoding and decoding.
For illustrative purposes, the table below shows how the different profiles behave in their treatment of invalid bytes in an encoded UTF-8 stream.
Profile Encoded input Tcl string Encoded output
decoding-> encoding->
tcl8 \x41\xe1\x42 U+0041,U+00E1,U+0042 \x41\xc3\xa1\x42
strict \x41\xe1\x42 * raises error *
replace \x41\xe1\x42 U+0041,U+FFFD,U+0042 \x41\xef\xbf\xbd\x42
lossless \x41\xe1\x42 U+0041,U+DCE1,U+0042 \x41\xe1\x42
Only the lossless profile preserves the original byte sequence after roundtripping.
Encoding strings
The encoding convertfrom
and encoding convertto
commands will accept
lossless
as a -profile
option value and implement the appropriate behavior
described earlier depending on direction of conversion.
Analogously, the lossless
profile can be effected in public C encoding
API's that accept profiles via their flags
parameter by specifying the
TCL_ENCODING_PROFILE_LOSSLESS
flag. These functions are Tcl_ExternalToUtf
,
Tcl_UtfToExternal
, Tcl_ExternalToUtfDStringEx
, Tcl_UtfToExternalDStringEx
.
Channel configuration
Likewise, channels configured with -profile lossless
via fconfigure
or
chan configure
commands will perform lossless encoding transforms on data
passed through the channel.
Implicit use of lossless profiles
Affected file systems
The changes to encoding transforms implicitly used in commands that call system API's only affects platforms that use the Unix/POSIX API's. In particular,
Windows is not impacted as its system API's use wide characters strings natively and are not byte streams.
The zipfs file system uses its own fallback strategy and will not be changed under the presumption that the fallbacks implemented there have been tuned to common usage in the zip world.
File paths
Commands that transfer file paths to or from the system will implicitly use the
system encoding with the lossless
profile. These include open
, cd
, pwd
,
open
, exec
load
, info nameofexecutable
as well as the file
and
chan
ensembles where applicable.
Note in the case of exec
and the pipe version of open
, passed arguments are
also encoded with the lossless
profile.
The equivalent C API's for file access will also use lossless
profiles when
translating file names into native form. This includes the internal
ProcessGlobalValue
API's that are used to share native strings across threads
(executable name, encoding and library paths, host names).
Global variables
Values read from the environment and stored in env
at program start up will be
transformed using the system encoding and the lossless
profile. Writes to the
env
will also use the same combination when storing into the native environment.
Command line arguments stored to argv
at program start up will be transformed
using the system encoding and lossless
profile.
As for file paths, this only affects platforms that use the Unix/POSIX API's.
Internal string representation
The conversions between the different Tcl's internal string representations would need to allow for isolated low surrogates. This is already the case but would need to continue to be so even if TIP 657 (which is silent on the matter) passes.
Error exceptions
There are code paths within the Tcl core where there is no mechanism for
reporting or handling errors. With exception of memory allocation failures
(which result in a panic) encoding operations are always expected to succeed
which was true with the existing tcl8
profile. The current lossless
profile implementation also adheres to this.
Discussion
Security issues
The ability to smuggle invalid byte encodings to and from Unicode can lead to security issues when a Tcl string that was decoded from a byte sequence using encoding A is then encoded using encoding B. The byte that was invalid in encoding A might be valid in encoding B and with special security implications (path separator etc.). This is a programming error as roundtripping should always be done using the same encoding. Nevertheless, to mitigate this, this specification (following PEP 383** will not map byte values < 128 into the U+DC00 surrogate space. Instead they are mapped to the encoding specific replacement character.
Since values < 128 are valid for all ASCII compatible encodings, there is no need to map them and thus this is generally not an issue. For encodings that are not ASCII compatible, such as EBCIDIC, UTF-16 and UTF-32 roundtripping is thus not possible as they will contain invalid bytes with values < 128 which will be replaced by a fallback character.
In practice, this limitation is of little consequence because the primary use of the profile is across system interfaces and encodings not compatible with ASCII are highly discouraged in the POSIX environments this TIP targets. To quote from here
There are only 19 encodings currently used worldwide as legitimate POSIX multi-byte locale encodings UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, EUC-JP, EUC-KR, GB2312/EUC-CN, KOI8-R, KOI8-U, VISCII, WINDOWS-1251, WINDOWS-1256.
Alternative mappings
There are other code point ranges that could have been used to map invalid bytes such as code points above U+10FFFF, private use code points, private use high surrogates etc. The primary reason for choosing U+DC00-U+DCFF was that private use code points may conflict with some application that uses private use code points for their own purpose. At the end of the day, it was felt safer to stick to the PEP 383 choice as that has been around for many years and (presumably) survived conflicts in real world use.
Tk compatibility
If file names containing "wrapped" invalid bytes are displayed in a widget, the bytes will be displayed using the replacement character glyph.
Implementation
Implementation is in progress in the tip-671 branch.
References
Copyright
This document has been placed in the public domain.