Tcl Source Code

View Ticket
Login
Ticket UUID: 535705
Title: Tcl interface to stop on encoding errors
Type: RFE Version: None
Submitter: nobody Created on: 2002-03-27 13:19:11
Subsystem: 44. UTF-8 Strings Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2023-02-06 13:03:50
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2023-02-06 13:03:50
Description:
When [encoding convertto] encounters a character that 
cannot be represented in the target encoding, it 
silently substitutes a default instead. In the C 
interface there is a flag TCL_ENCODING_STOPONERROR which 
instead causes the underlying function to return with an 
error when this condition occurs. It would be useful to 
have a switch for [encoding convertto] to get the same 
behaviour there, especially for cross-platform 
installation programs (where encoding conversions are 
common and unnoticed character substitutions can be very 
dangerous).

PS: While I'm at [encoding] anyway, I might as well 
point out that the man page for the [encoding] command 
(of Tcl8.4 as currently on tcl.activestate.com) in the 
Example section incorrectly claims that "because the 
source command always reads files using the ISO8859-1 
encoding, Tcl will treat each byte in the file as a 
separate character that maps to the 00 page in Unicode". 
I believe the truth is rather that [source] uses the 
system encoding. The example should work, since \xHH 
produces character in said Unicode page, but including 
the corresponding bytes as explicit characters in the 
source probably won't work cross-platform.
User Comments: jan.nijtmans added on 2023-02-06 13:03:50:

Due to TIP #601, this is now fully implemented


oehhar added on 2021-05-07 10:28:09:

Great proposal ! Shall I update the TIP ?


jan.nijtmans added on 2021-05-07 10:06:38:

> I am not in favor to include the byte value in hex of the input byte causing the conversion error. The reasons are: ...

Well, how about a better error-message then? The first byte of invalid UTF-8 tells something useful: If it is between 0x80-0xBF or 0xC1 or between 0xF8-0xFF then that's the error-byte. If it is 0xC0 or between 0xC2-0xF7 then it's a follow-up byte problem.

> When converting to a single byte encoding (example ISO-8859-1), the error byte value will be the first byte of an utf-8 sequence. IMHO this causes more issues than it helps.

The error-message contains the character value in this case, not a byte value. It shows the character code, which must be < 0xFF for ISO-8859-1.

I changed the error-message now to "unexpected byte sequence starting at index ..."


jan.nijtmans added on 2021-05-06 15:35:39:

Improved error-message for Tcl 9 now:

    $ tclsh
    % puts \uFFFF
    error writing "stdout": illegal byte sequence


oehhar added on 2021-05-06 15:15:14:

The ticket mentions issues with the documentation of the encoding command. Starting with commit [aed33ad43a], the example was modified to only have a small conversion example.

The whole part about the source command is outdated and anyway should go to the source command documention. In consequence, it is deleted in the encoding command documentation.


oehhar added on 2021-05-06 15:11:22:

Jan, thank you for the edits in the TIP. I merged them in the TIP happily.

May I raise one minor point?

I am not in favor to include the byte value in hex of the input byte causing the conversion error. The reasons are:

  • In a multi-byte sequence like an invalid utf-8 sequence, the error is not just one byte, but a series of bytes. Unfortunately, only the first byte is shown.
  • When converting to a single byte encoding (example ISO-8859-1), the error byte value will be the first byte of an utf-8 sequence. IMHO this causes more issues than it helps.
  • The error will not happen on ASCII characters (they always convert to everything). The error will happen on special characters where only a very limited number of persons know (and understand) the Byte Codes.

Any opinion welcome.

Thank you, Harald


jan.nijtmans added on 2021-05-06 14:19:20:

> Is the syntax variant handled, where the encoding is not given, but -stoponerror

No, that's not correct yet. Still to be fixed!


oehhar added on 2021-05-06 12:31:20:

Jan,

thank you for the action.

Is the syntax variant handled, where the encoding is not given, but -stoponerror:

encoding convertto -stoponerror ÄÖÜ

Thanks, Harald


jan.nijtmans added on 2021-05-06 11:53:18:

Implementation modified now, making "-stoponerror/-nothrow" first argument in stead of the last. That's a good idea.

In stead of "-stoponerror 1" I would prefer just "-stoponerror", and in stead of "-stoponerror 0" I would prefer simply "-nothrow". Using boolean flags makes sense when the command has many parameters, but this one only has a few. Also it makes more clear what the two differences are, either stop when an error occurs (throwing an exception then), or don't throw an exception (replacing the invalid byte/char with \uFFFD, as Tcl 8.x does). The default is "-nothrow" in Tcl 8.x and "-stoponerror" in Tcl 9.0. This way is it very well possible to change the default in 9.0.

Thanks for all you are doing!


oehhar added on 2021-05-05 16:43:09:

Please consider TCL TIP601 which proposes a related change:

https://core.tcl-lang.org/tips/doc/trunk/tip/601.md

Harald


jan.nijtmans added on 2021-04-01 13:57:33:

Experimental implementation added: -stoponerror is now the default if Tcl is compiled with -DTCL_NO_DEPRECATED.

Demo (macOS with utf-8 as system-encoding):

    $ tclsh
    % puts \uFFFF
    error writing "stdout": invalid argument

So it works, but error-message can be still improved a lot.


jan.nijtmans added on 2021-04-01 10:47:41:

I don't think the implementation is there yet. What if the system encoding is "utf-8" and we are doing simply:

puts stdout \uD800
Then we are producing non-conformant UTF-8. I would expect the same exception to be thrown then(in Tcl 9.0, we cannot do that in Tcl 8.x)


jan.nijtmans added on 2021-04-01 10:44:03:

> I think, this change is valueable and the default behaviour is IMHO more a bug than a feature.

Yeah. Maybe the -stoponerror should just be the default in Tcl 9.0 ....

> I propose to take the challenge ...

I'll be a happy sponsor! No hurry, better well-done.


oehhar added on 2021-04-01 08:38:38:

Jan, about the "challenge". I think, this change is valueable and the default behaviour is IMHO more a bug than a feature.

I propose to take the challenge including the steps:

- writing the TIP - implementation - tests - documentation

This may take a while (3 months).

Thank you for the work and support, Harald


oehhar added on 2021-03-31 19:28:53:

Dear Jan,

first of all, thank you for the implementation. That is great.

Here is an example with two two-byte utf-8 sequences, where the last byte is not transmitted:

% catch {encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1] -stoponerror} e d
1
% set d
-code 1
-level 0
-errorstack {INNER {invokeStk1 ::tcl::encoding::convertfrom utf-8 Äà -stoponerror}}
-errorcode NONE
-errorinfo {unexpected byte at index 2: 'Ã' (\xC3)
    while executing
"encoding convertfrom utf-8 [string range [encoding convertto utf-8 ÄÖ] 0 end-1] -stoponerror"}
-errorline 1

This is great.

I would prefer to return the position index (2) as part of the error code to make it machine parsable without parsing an error message. Example error code proposal: [list ENCODING STOPONERROR $errorPosition]

So, the following code would be possible without string parsing on the error message:

proc receive {data} {
    # This is an eventual remainder of the last run
    global unprocessedData
    # Append current data chunk to last remainder
    append unprocessedData $data
    try {
        set dataUTF8 [encoding convertfrom utf-8 $unprocessedData -stoponerror]
        set unprocessedData ""
    } trap {ENCODING STOPONERROR} {errorMessage errorDict} {
        # Partial receive
        set errorPosition [lindex [dict get $errorDict -errorcode] 2]
        set dataUTF8 [encoding convertfrom utf-8 [string range $unprocessedData 0 $errorPosition-1]
        set unprocessedData [string range $unprocessedData $errorPosition end]
    } on error {
        # handle other errors
    }
    # process dataUTF8
}

The idea is a receiving routine receiving binary data which is an utf-8 stream, but the received limits may not match. The receive procedure is called for each received data package and manages the assembly to a valid utf-8 string without loosing or changing data (both possible without flag -stoponerror).

Thank you again, I appreciate, Harald

Note 1: Other interfaces are possible, specially that the already encoded substring until the error position may be available somehow, so it is not necessary to do a 2nd call to "encoding convertfrom". But that is a detail.

Note 2: I did not check for a possible possix error code which might be more appropriate. The error code list is just an example.

Note 3: I don't see the need to give the error byte value in the error message. For me, the error position in the source string in the error code would just do the job. IMHO this is for automatic processing, not for radable text.

Note 4: As readable error message, I would prefer to mention the used encoding to have something like: "data not in 'utf-8' format".

Note 5: For the records: Test suite results with MS-VC2015 32 bit on MS-Win10 64bit GER, compared to tcl8.7a3:

Still failing tests: socket-14.11.0


oehhar added on 2021-03-30 15:20:38:

Thank you, Jan, great work. I plan to look to it tomorrow.

Thanks, Harald


jan.nijtmans added on 2021-03-30 15:08:38:

@harald. The "encodings-with-flags" has better error-messages now, it doesn't give only the position, but also the byte or character where the error occurred and the Unicode value of it.

Can you try it, and make modifications how you want the error returned? Since you have a use-case, you are the best person to review/revise how the API should look like. I'll leave it to you (if you accept that challenge....)

The "encodings-with-flags" branch is now derived from the tip-597 branch. This branch has more encodings with more error-situations, this makes the STOPONERROR flag better testable. I think TIP #597 is ready to put up for voting.

Thanks!


oehhar added on 2021-03-26 17:28:41:

Great, thank you !

May there be a way to return the error position (or ok count) in a more portable way like in the error code list ? -> errorCode: "ERR_ENCODING POS 3"

As an alternate solution, we may return the ok data and find a way to inform about : error yes/no and error position. IMHO this would be the most fruitful return set possible.

Thank you, Harald


jan.nijtmans added on 2021-03-26 17:03:28:

First concept now in "encodings-with-flags" branch.

Demo:

% encoding convertfrom utf-8 12Ã -stoponerror
encoding error after producing 2 characters


oehhar added on 2021-03-26 08:47:33:

Dear Jan,

thank you for approaching this, that is great !

A common use-case (for me) is to receive utf-8 over a byte channnel and to detect the character limits.

For example, the sender sends: "12Ü" which is the utf-8 byte stream: 65 66 195 132

Lets receive character per charcter:

set s [encoding convertto utf-8 12Ü]
12Ü
(bin) 19 % encoding convertfrom utf-8 [string index $s 0]
1
(bin) 20 % encoding convertfrom utf-8 [string index $s 1]
2
(bin) 21 % encoding convertfrom utf-8 [string index $s 2]
Ã
(bin) 22 % encoding convertfrom utf-8 [string index $s 3]
œ

If the sequence is incomplete (and thus invalid utf-8), the undecoded character is returned instead of utf-8.

I would appreciate, to have the following properties:

  • no replacement on incomplete characters
  • the character index of the error is returned, so the issue may be examinated.

Background: https://wiki.tcl-lang.org/page/Unicode+and+UTF%2D8

Thank you, Harald