Tcl Source Code

View Ticket
Login
Ticket UUID: 6978c01b652ac87cfc7913f5e534825f15824768
Title: Channel encoding difference 8.6 <-> 9.0
Type: Bug Version:
Submitter: oehhar Created on: 2022-08-31 10:11:28
Subsystem: - New Builtin Commands Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Severe
Status: Closed Last Modified: 2023-03-24 22:00:27
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2023-03-24 22:00:27
Description:

While testing, if TIP 601 did anything on the channel system, I have made two tests with TCL 8.6.12 and TCL 9.0 on Windows, compiled with MS-VC6 and VS2015 for 32 bit.

When sending a character not contained in the current encoding, the replacement character "?" is sent. On 9.0, for my perception, nothing is sent and sending stops.

I started two tclsh shells with the following pasted into the console:

set serverhandle [socket -server connect 30000]
proc connect {handle args} {
	if {[catch {
		fconfigure $handle -encoding iso8859-1
		puts $handle A\u2022B"
		close $handle
	} res]} {
		puts stderr "Write error: $res"
	}
}
vwait forever

and the client:

set clienthandle [socket localhost 30000]
fconfigure $clienthandle -encoding iso8859-1 -eofchar ""
puts "Read data: '[get $clienthandle]'"
close $clienthandle

TCL 8.6.12 returns on the client side:

Read data: 'A?B"'

and on the TCL 9.0 client side:

Read data: 'A'

I think, on 9.0, the replacement "?" for the not representable character \u2022 (a centered dot) is not sent.

I am not sure, what the correct result would be. I am a bit lost in the discussion.

Take care, Harald

User Comments: jan.nijtmans added on 2023-03-24 22:00:27:

Actually, this ticket is a duplicate of [1bedc53c8c]


jan.nijtmans added on 2023-03-24 21:59:15:

Fixed here.

Actually, this ticket is a duplicate of [18437bdc82d0152c]


oehhar added on 2022-09-15 06:49:34:

Hi Jan, thanks for the work. IMHO, the read issue is similar to the write issue. The error is not thrown., as an eventual error result from *Tcl_ExternalToUtf* is not handled.

I have noticed that when implementing TCL9 version on TIP633: https://core.tcl-lang.org/tcl/info/995f1b858d0483cb.

This routine is called 3 times in tclio.c, while IMHO only one is relevant.

I may work on this the next two days.

Take care, Harald


jan.nijtmans added on 2022-09-13 16:04:54:

Testcase io-75.6 and io-75.9 are now marked as "knownBug". When Tcl is compiled with --enable-symbols, they throw an exception:

Test file error: Assertion failed: (!GotFlag(statePtr, CHANNEL_EOF) || GotFlag(statePtr, CHANNEL_STICKY_EOF) || Tcl_InputBuffered((Tcl_Channel)chanPtr) == 0), function DoReadChars, file tclIO.c, line 6009.

Since those 2 testcases give the wrong answer (stripping the output), what more proof do we want there is a bug here?


jan.nijtmans added on 2022-09-13 12:39:10:

Better re-open: It looks like a similar bug-fix as done in Write() is needed in Read() as well... Definitely not fixed! Thanks for the testcases showing this!


jan.nijtmans added on 2022-09-13 09:07:32:

Hi Harald.

Why does the line set d [read $f] in testcase io-75.9 (Tcl 9.0) not throw an exception or output the bytes verbatim?

Also, testcase io-75.10 shows a situation where a -strictencoding could be useful. The current behavior is just output the invalid byte, but it's something to think about ....


jan.nijtmans added on 2022-09-12 13:09:35:

> On TCL 9.0, the output is truncated at the first error position.

That's what I expect. Good!

I wouldn't be surprised if the TCL_ENCODING_STOPONERROR flags (or the absence of the TCL_ENCODING_NOCOMPLAIN flag in Tcl 9.0) whould still be handled wrong in the shift-jis encoding: This flag was present, but not accessible on the script level, so it was never tested well.

For TIP #633, I would suggest to introduce a new flag TCL_ENCODING_STRICT, which has the value 0x44 in Tcl 8.7 (so both TCL_ENCODING_STOPONERROR and TCL_ENCODING_NOCOMPLAIN are set, which is - for now - an impossible situation). When using "-strictencoding 1", those flags will be given to the encoder, which takes care of the rest.

Anyway, still work to do .....


oehhar added on 2022-09-12 12:18:55:

Thank you!

The read tests are now repeated with shiftjis encoding. So, we have for utf-8 and shift-jis:

  • invalid sequence read: io-75.1 and io-75.4
  • EOF before sequence completion: io-75.3 and io-75.5

The behaviour is the same for both encodings:

  • On TCL8.6 and TCL 8.7, the wrong bytes are outputted as verbatim data.
  • On TCL 9.0, the output is truncated at the first error position.

Take care, Harald


oehhar added on 2022-09-12 09:59:04:

Jan, thank you for the action, great !

Some comments:

Write error position

I am not sure, which way would be best:

  • work like a filter, e.g. all characters but the error characters are sent (new implementation).
  • stop on error (old implementation).

Example:

fconfigure $h -encoding iso8859-1
puts $h "A\u2022B"

With the filter implementation, "AB" is sent, and an error is thrown. With the stop implementation, "A" is sent, and an errpr is thrown.

IMHO, the error implementation would be better on a user level. If there is an error, I can introspect what happened and react on the channel, like sending "&h2022" and then resuming, e.g. sending "B". With the filter implementation, it is to late to act. The "B" is already sent.

I think, the difference is small and not important. The only point is to document well, what is happenning.

And I don't care, only get 9.0 out is the aim.

UTF-8 special

Ok, if UTF-8 is special, then we would need special test cases for that. So, I will add test cases for UTF-8 and another multibyte like shift-jis. I have to dig-in to find one which may have invalid sequences.

I find it quite funny, that the ISO8859-1 folks get a work-around/compatibility layer, but all other system encodings don't get that. My system encoding is cp1250, which is similar to ISO8859-1. But all scripts only use ASCII, even comments. This is quite a challenge and quite critical. I remember, when we changed the encoding of msg files from system encoding to UTF-8 with TCL8.6. It failed at many places.

I personally would prefer to throw an error for source to on invalid utf-8. If people want to migrate to TCL9 and the current system encoding is "shift-jis", they get garbage result and no error on the source command, as everything is accepted. I would prefer the error message.

TIP633 scope

The only scope of TIP633 is to get the TCL 8.6 behaviour optionally also available on TCL9.0. If UTF-8 accepts ISO8859-1, then this is ok and not covered by the TIP.

Thank you for all, Harald


jan.nijtmans added on 2022-09-11 22:07:03:

The bug should now be fixed on both core-8-branch (only visible when compiling with -DTCL_NO_DEPRECATED) and trunk.

Still, something needs to be said about the utf-8 encoding. The utf-8 encoding is special in that it can also read iso8859-1 and cp1252 without complaining on 'invalid utf-8'. For details, see TIP #587. Since in Tcl 9.0, the default encoding on Windows changed from cp1252 to utf-8, I'm afraid that there are many Tcl scripts around still encoded in either iso8859-1 or cp1252. If Tcl would use -strictencoding 1 by default in 9.0, those scripts will stop working in Tcl 9, all users would be forced to convert their scripts to proper utf-8 before they can do anything. I'm not prepared to take that risk. utf-8 is still not sufficiently common on Windows.

However, TIP #633 can introduce a -strictencoding option, which is only functional for utf-8 (all other encodings are strict already). So, people who want to be strict can enable it, without causing a burden for other people.

Anyway, since this bug is fixed now, it's out-of-the-way for TIP #633. And thank you for providing the testcases and your 'fix', which pointed me to the right direction.

Hope this helps. Jan Nijtmans


jan.nijtmans added on 2022-09-11 21:00:57:

I'm sorry, but checkin [e67a247dd4] still doesn't fulfull my expectation of what should happen. I adapted thestcase io-75.5, testing for both the written output (which was correct before the fix) and the exception thrown. With this fix, the expected output is missing.

This shows that channel handling is a bit more complicated than expected. For example, what if the 7th byte of a sequence is in error, but after writing 4 bytes to the channel another error occurs? Then - I would say - the error that occurs first in the byte stream (not the error which is discovered first) should win.

So, after the encoding error is detected, the channel should continue to operate as normal, only it shouldn't read in new bytes, just write out the bytes as far as they were encoded. As soon as all bytes are handled there are two possibilities: 1) all bytes were written correctly, then the exception should be thrown after all. 2) the write resulted in some error, than that exception should be thrown ....

Hope this helps.


oehhar added on 2022-09-11 13:47:26:

Checkin [e67a247dd4] cures the invalid write case and makes test io-75.5 as a real test.

The invalid read items are still open.


oehhar added on 2022-09-11 12:56:51:

Some debugging.

The call to tclio.c/4348:

	result = Tcl_UtfToExternal(NULL, encoding, src, srcLimit,
		statePtr->outputEncodingFlags,
		&statePtr->outputEncodingState, dst,
		dstLen + BUFFER_PADDING, &srcRead, &dstWrote, NULL);

In TCL 9.0 with outputEncodingFlags=0 results in the result = -3 (TCL_CONVERT_UNKNOWN).

This error is ignored.

I think, we would need an error return list after the call like:

if (result == TCL_CONVERT_UNKNOWN) {
    Tcl_SetErrno(EILSEQ);
    return -1;
}

Sunday session over, back on monday.

Take care, Harald


oehhar added on 2022-09-11 12:24:35:

The error below is a side-effect of incomplete UTF-8 sequence check.

It gets out at tclio.c Line 4365:

	if ((result != TCL_OK) && (srcRead + dstWrote == 0)) {
	    /*
	     * We're reading from invalid/incomplete UTF-8.
	     */

	    if (total == 0) {
		Tcl_SetErrno(EILSEQ);
		return -1;
	    }
	    break;
	}

This is not the intended place...

Take care, Harald


oehhar added on 2022-09-11 12:17:04:

Hi Jan,

the error shows up, if the issue character is the first character:

% set f [open c:/test/test.txt w]
fileb6a3c0
% fconfigure $f -encoding iso8859-1 -strictencoding 1
% puts $f "\u2022"
error writing "fileb6a3c0": illegal byte sequence
% puts $f "A\u2022"

If there is the "A" in front, it does not show up.

Take care, Harald


oehhar added on 2022-09-11 08:08:11:

Thank you. There are now 3 test cases:

  • invalid utf-8 sequence read
  • write unrepresentable character
  • incomplete utf-8 sequence read (eof before end)

I personally don't understand, why the tests are depreciated in 8.7.

I plan with TIP633 to:

  • add "fconfigure -strictencoding 0" to 8.7 and the tests
  • duplicate the tests in 9.0 (where they should succeed then)
  • add "fconfigure -strictencoding 1" to 9.0 and move the current tests in 9.0 75.1 to 75.3 to new tests.

Is this a plan ?

Thanks, Harald


jan.nijtmans added on 2022-09-10 12:16:48:

Thanks you for the testcases. I'll have a further look


oehhar added on 2022-09-09 14:07:05:

Two new tests added to prepare TIP 633:

  • io-75.1: read an invalid utf-8 sequence from a file
  • io-75.2: write an unrepresentable character to a file

Those tests are in:

  • TCL 8.6: [2221b9b6e3]
  • TCL 8.7: [7ab32e9bf1]
  • TCL 9.0: [67ffff5b7a]

In TCL 8.6 and 8.7, both operation give results and succed. This works currently well.

On TCL 9.0, both operations should fail. Currently, they don't. In consequence, the TCL 9.0 tests are marked as knownBug.


oehhar added on 2022-08-31 11:44:21:

Ok, thank you for the information. So, TIP633 makes sense, if an error is expected.

I also tried the other way around, e.g. receiving an invalid sequence. My standard example is to receive an incomplete UTF-8 sequence.

Here is my test script for two interpreters:

set serverhandle [socket -server connect 30000]
proc connect {handle args} {
	if {[catch {
		fconfigure $handle -encoding utf-8 -buffering none
		puts "Read data: '[gets $handle]'"
		close $handle
	} res]} {
		puts stderr "Read error: $res"
	}
}
vwait forever
set clienthandle [socket localhost 30000]
fconfigure $clienthandle -encoding binary
puts -nonewline $clienthandle\
		[string range [encoding convertto utf-8 "AÄ"] 0 end-1]
close $clienthandle

Here, 8.6.12 and 9.0 both return:

<verbatm> Read data: 'AÃ'

In case of 9.0, this should be "A" and an error, which is not raised. The 8.6 case is "A" and then the incomplete sequence as raw byte.

So, this looks boguous too.

-----

I also wondered, how those changes are covered by TIP 601, as I don't find anything about it in TIP 601. It is an indirect consequence probably.


jan.nijtmans added on 2022-08-31 10:33:19:

Indeed, this is not expected. I can reproduce the problem.

The "puts" should output 'A' and then throw an exception. It stops correctly, but the exception is what's missing. Probably a check missing in the C-code for this error-condition....