Author: Harald Oehlmann <[email protected]>
Author: Jan Nijtmans <[email protected]>
State: Withdrawn
Type: Project
Vote: Done
Tcl-Version: 9.0
Tcl-Branch: encodings-with-flags
Vote-Summary Accepted 6/0/0
Votes-For: FV, JD, JN, KW, MC, SL
Votes-Against: none
Votes-Present: none
Abstract
This TIP proposes to change the behavior of encoding convertto/convertfrom
in Tcl 9.0, to start throwing an exception on any conversion error.
An additional option -nocomplain
is proposed to restore the Tcl 8.x behavior.
This TIP is withdrawm, in favor of TIP's #656 and #657
Rationale
The command encoding convertfrom/convertto
currently does not raise any error on not convertible input.
Instead the following actions are observed:
- A replacement character
?
is used. - An incomplete multi-byte sequence is added verbatim.
Example 1: not encodable character
The Polish character "L with bar" is not contained in ISO-latin 1:
% set s \u0141
Ł
% encoding convertto iso8859-1 $s
?
In the ISO-Latin 1 conversion, it is replaced by a question mark by the encoding convertto
command.
Example 2: Incomplete sequence returns remaining value verbatim
The following utf-8 sequence has an incomplete final sequence. The second byte of the two byte sequence of the last character is missing. The incomplete sequence is interpreted as ISO8859-1 and added to the string.
% set s [encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]]
ÄÃ
% scan $s %c%c
196 195
The first character value 196 is the correct "Ä" character. The second character is the verbatim byte of the incomplete utf-8 sequence:
% scan [encoding convertto utf-8 Ö] %c%c
195 150
Use case 1: decode continuous multi-byte data
My personal use-case is a stream of UTF-8 data which is received by a USB character driver and the binary data is cut in 64 byte chunks. The stream is continuous and I want to decode the received data. If a UTF-8 byte is received partly, a false byte is created and the next chunk decoding does not work, as it starts with a part of a UTF-8 multibyte sequence.
It would be great to know, where the error is to stop the sequence.
Here is a code snipped with the current implementation:
% catch {encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]} e d
1
% set d
-code 1\
-level 0
-errorstack {INNER {invokeStk1 ::tcl::encoding::convertfrom utf-8 Ã\x84Ã}}
-errorcode {TCL ENCODING ILLEGALSEQUENCE 2}
-errorinfo {unexpected byte sequence starting at index 2: '\xC3'
while executing
"encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1]"} -errorline 1
Note: "\0x84" was replaced for the control character "IND" for visibility in the stack trace
Note: "C3" is the hexadecimal representation for decimal 195. Thus, "\xC3" is the first byte of the utf-8 representation of "Ö".
With this info, the -errorcode
may be catched by a try
clause and the error byte location (2) may be catched (see discussion section for an example).
The data before may by passed again to encoding convertfrom
which is the correct received data.
Note: it would be efficient, if the already converted string may be returned also.
Then, the data must not be passed again to encoding convertfrom
.
The current implementation does not fullfill this optimisation.
Use case 2:
This use case is given in the following TCL ticket: Ticket 535705 :
Wrong characters are included in a data base by character replacement when a character not in the current system encoding. This causes issues in a multi-platform applications, as the error is not detected.
Specification
New Option
The encoding
ensemble will be extended by a new option -nocomplain
:
encoding convertfrom ?-nocomplain? ?encoding? data
encoding convertto ?-nocomplain? ?encoding? data
In Tcl 8.x, -nocomplain
has no effect since those encoding subcommands
currently never throw an exception: Any invalid byte/character is replaced
(by ? or U+FFFD). In Tcl 9.0, those subcommands start throwing and exception
on any conversion data error, -nocomplain
restores the Tcl 8.x behavior.
Definition of "error position"
The position of the error in the source string is indicated in the error reporting. In case of multi-byte source data, this position is always one byte after the last correct multi-byte sequence.
###Error Message
The error message is: "unexpected byte sequence starting at index error position: 'byte value'", for encoding convertfrom
or "unexpected character at index error position: 'character value'", for encoding convertto
where error position is a number containing the source string error position as defined above.
byte value/character value is the hexadecimal representation of the byte in the source string where error position points to.
Error Code
The error code is composed of the following 4 list elements:
- Fix word:
TCL
- Fix word:
ENCODING
- Fix word:
ILLEGALSEQUENCE
- Value error position: The index in the source string (usually a byte array, in case of
encoding convertfrom
) of the error position.
New C API
Introduce 2 new functions:
Tcl_Size Tcl_ExternalToUtfDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr)
Tcl_Size Tcl_UtfToExternalDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr)
These functions behave the same as Tcl_ExternalToUtfDString/Tcl_UtfToExternalDString
, only they have
an additional flags
parameter accepting the following additional values (can be used in combination):
- TCL_ENCODING_STOPONERROR: don't replace invalid characters/bytes but return the first error position. Default in Tcl 9.0.
- TCL_ENCODING_NOCOMPLAIN: replace invalid characters/bytes by a default fallback character. Always return
TCL_INDEX_NONE
. Default in Tcl 8.7. - TCL_ENCODING_MODIFIED: convert NULL bytes to \xC0\x80 in stead of 0x00. Only meaningful for "utf-8" and "cesu-8", ignored for other encodings. This flag may be used together with the other flags.
The TCL_ENCODING_MODIFIED flag can be used in extensions for generating "modified" encodings, such as java
(which uses "modified" cesu-8 internally). This flag is not exposed at script level, unlike -nocomplain
.
The (already existing) TCL_ENCODING_STOPONERROR flag is only provided for legacy reasons. This flag will be meaningless starting with Tcl 9.0, therefore will be deprecated in Tcl 9.0 and eventually removed in a future Tcl version (but not yet in 9.0). In Tcl 9.0, TCL_ENCODING_STOPONERROR will be defined as value 0.
The return value of these two functions is the error-position in case of an error reporting, or TCL_INDEX_NONE
if everything is OK.
Discussion
Ticket 535705
This TIP started in the TCL ticket 535705. Please refer to this ticket for information about the initial discussion.
Error reporting design
The list of categories for the error code return is given in the tclvars manual page. The TCL category matches best.
This design allows to catch this error and get the error position by the following try
pattern:
try {
set res [encoding convertto iso8859-1 $input]
} trap {TCL ENCODING ILLEGALSEQUENCE} {errorMessage errorDict} {
set errorIndex [lindex [dict get $errorDict -errorcode] 3]
...
}
Rejected alternatives
Report the error character
The original implementation reported the failing character/byte in the error message. This may be a control character corrupting a terminal view. IMHO error messages should be in the printable ASCII character set. Therefore this was removed, the character/byte is now only reported in hexadecimal notation.
-nocomplain as boolean option
An alternative would be to use -nocomplain 1
in stead of -nocomplain
and -nocomplain 0
in stead of -nocomplain
.
This only makes the command longer, without much benefit.
-stoponerror
This option is the reverse of -nocomplain
, but much less descriptive. In combination with -nocomplain
,
it causes more confusion than that it helps.
EILSEQ POSIX error code
Recent changes to TCL use the POSIX error EILSEQ: "invalid byte sequence", which looks like the most appropriate error message. Nevertheless, the POSIX message does not allow to return the error position.
Alternate solutions
See [607], which is not actually an alternative, but was added later on top of this TIP.
Implementation
Implementation is in Tcl branch encodings-with-flags
The original implementation used size_t
in the function signature, but after [628] this changed to Tcl_Size
.
Compatibility
The implementation is fully backward compatible for 8.7. There is a compatibility break for TCL 9.0.
Credits
Thanks to Jan Nijtmans for idea and implementation.
Copyright
This document has been placed in the public domain.