Tcl Source Code

View Ticket
Login
EuroTcl/OpenACS 11 - 12 JULY 2024, VIENNA
Ticket UUID: 85ddd247b637bcb25990fc55536e0ce3f9e754b2
Title: Unable to set channel to binary encoding
Type: Bug Version: 9.0b2
Submitter: cjmcdonald Created on: 2024-06-19 16:56:40
Subsystem: 25. Channel System Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Important
Status: Open Last Modified: 2024-06-23 22:24:28
Resolution: None Closed By: nobody
    Closed on: 2024-06-21 07:26:48
Description:

Tcl 9.0b2

Windows 10 22H2

Default build of Tcl 9.0b2:

    nmake -f makefile.vc INSTALLDIR=C:\Tcl9
    nmake -f makefile.vc INSTALLDIR=C:\Tcl9 install

Trying to change a file channel to binary, it gets set to iso8859-1:

C:\Tcl9\bin\tclsh90.exe
% encoding system
utf-8
% set f [open temp.txt w]
file268360e3060
% fconfigure $f
-blocking 1 -buffering full -buffersize 4096 -encoding utf-8 -eofchar {} -profile strict -translation crlf
% fconfigure $f -encoding binary
% fconfigure $f -encoding
iso8859-1
% close $f
%
% set f [open temp.txt wb]
file268360e3060
% fconfigure $f -encoding
iso8859-1
% close $f
%

I encountered the same issue with a tcludp channel in Tcl 9.0b2.

A possibly related issue occurs running the same build on a Windows Server 2019 system. It now incorrectly reports the system encoding as iso8859-1.

C:\Tcl9\bin\tclsh90.exe
% encoding system
iso8859-1

With Tcl 8.* it was correctly reported as cp1252. I'd guess that Windows Server 2019 may not support the <activeCodePage> UTF-8 setting which has been added to the tclsh manifest.

User Comments: pooryorick added on 2024-06-23 22:24:28:

To restate, "binary" data is nothing more than Unicode text containing only characters defined in the ASCII and Latin-1 code charts.


pooryorick added on 2024-06-23 22:16:49:

There is no "binary" encoding. At the script level it was always just iso8859-1, and always will be. This is not an implementation detail leaked to the script level, just the fact that at the script level there is no "binary" mode, and that "binary" was always just an ill-advised synonym for iso8859-1. At the script level there is only Unicode text, and if one wishes to manipulate "binary" data, one must in principle translate each incoming byte into the corresponding Unicode code point (even if internal optimizations like bytearray elide this step). If each character in the text has a code point of 255 or less, it is amenable to interpretation as bytewise numerical values (binary).

In Tcl 8.6, configuring the translation of a channel as binary and then checking configuration yields a translation of "lf", not "binary", and this has always worked just fine. The same should be true for encoding: Programmers should understand that "binary" means iso8859-1 (assuming binary isn't removed as an encoding).

If anyone is writing code that is relying on whether the encoding for a channel is "binary", their code is probably buggy, and it probably would be less buggy if Tcl had never provided "binary" as an encoding option in the first place. Binary data can come in through just about any encoding. It doesn't make much sense to condition code on whether the channel encoding is "binary".

There is sufficient introspection available already. There is no need for a [chan isbinary], and it would be a bad idea to provide it. It generally makes more sense not to introspect the encoding, and instead to just set the encoding to what it is expected to be. I would like to see the use case for introspecting encoding.

The best course of action would be to eliminate "binary" as an encoding, since it never really was one, and leave "binary" only as an argument to -translation.


griffin added on 2024-06-23 21:47:29:

What has been done is that the implementation detail that "binary" is identical to "iso8859-1" is exposed to the script level. This is a detail that should not be exposed. The binary encoding has a distinct characteristic that separates it from all other encodings. It could be that, in the future, the internal implementation is changed again. Then what do you do?

"binary" needs to be supported, honoring the behaviors currently in prior release versions.


jan.nijtmans added on 2024-06-23 20:57:32:

How about adding a [41900f4c4fcae34a|chan isbinary] command, which - for the first time - gives the right answer?


jan.nijtmans added on 2024-06-23 20:57:04:

How about adding a [chan isbinary|41900f4c4fcae34a] command, which - for the first time - gives the right answer?


jan.nijtmans added on 2024-06-23 20:56:30:

So let's go back to the 'introspection' part. I already brought forward that this ticket actually handles the possibility to introspect whether a channel is binary or not. Checking for "chan configure -encoding binary" is not it: Still, the channel could use -translation or -eofchar, which means that the channel is not binary.

How about adding a [chan isbinary|41900f4c4fcae34a] command, which - for the first time - gives the right answer?


pooryorick added on 2024-06-23 08:33:05:

typo correction: "if you want to, you can treat it as a list"


pooryorick added on 2024-06-22 22:04:53:

For those struggling to understand this concept, consider a similar situation with lists: Every list is a string, i.e. Unicode text. Internally, for performance sake, a structured representation of a list may be attached to the Tcl_Obj for that value, but that's just an optimization. A list is Unicode text, and if you want to, you cant treat it as a list. The same is true with binary data in Tcl: Binary data is Unicode text, and if you want to, you can treat the characters in that text as values representing something other than characters. That doesn't change the fact that the value is Unicode text. In Tcl there is no bifurcation between text and binary values. It's all text. It's just that some text can be interpreted as binary if that's what suits the application.


pooryorick added on 2024-06-22 21:53:38:

Every value is a string. It's that simple. bytearray is strictly an internal optimization. In Tcl, binary data is simply Unicode text. Don't let the internal optimizations fool you.


griffin added on 2024-06-22 12:53:07:

I find this discussion strange. I agree with sebres here, "binary" means raw data, which in Tcl terms is a bytearray. There is no mapping for the sequence of bytes to any string or glyph representation.

If the tcl interpreter is forced to print the bytearray (i.e. convert the bytes to a glyph), then iso-8859-1 is the defined mapping that tcl (tk?) should use, even if the bytes aren't actually meant to be used as printable glyphs. (Note: even iso-8859-1 is not ideal because not all byte values have a defined glyph) In reality, the receiving device will ultimately decide what glyph mapping is used.

The term "binary" has always meant "raw data, don't touch or interpret please!". So I disagree with Nathan's interpretation wrt iso-8859-1. Binary means binary, and the channel configuration should reflect that on introspection, even if, internally, the treatment of the data is identical between iso-whatever (iso-8859-2, iso-8859-7,...) and binary. This is important information the coder needs.


jan.nijtmans added on 2024-06-22 12:31:47:
> Also I don't understand how it can be comparable with iso8859-1 (where, strictly seen the standard, it definitely misses chars 0x00-0x1F and 0x7F-0x9F, so I guess it need to fail with profile strict or replace with profile replace, and even if it doesn't right now, it may be changed later).

Allow me to correct this. The ranges 0x00-0x1F and 0x7F-0x9F in iso8859-1 are control codes, definitely a valid range. Therefore "iso8859-1" is one of the few encodings with a unique property: Reading a file with encoding set to "iso8859-1", there are no invalid bytes, so it doesn't matter which -profile is set: No encoding errors are thrown during file reading, whether -profile is "strict", "tcl8" or "replace", or whatever future value. That should never change.

In Tcl 8.x, "binary" and "iso8859-1" are not 100% aliases: "binary" has additional optimizations when using byte arrays. But at script level, those two behave the same. "binary" has some shortcuts which makes things faster, which "iso8859-1" doesn't have. In Tcl 9.0, "iso8859-1" inherits the same optimizations, which means that "binary" now is a real alias to "iso8859-1".

Hope this helps

sebres added on 2024-06-22 11:16:59:

> The documentation should explain that in Tcl "binary" data is text

It is not! It is and remain bytearray, at least as long one is working only with commands supporting binary data (bytearray), it'd even never shimmer to something else (and shall not do that).

EAIS may lead people to make the same mistake, you did, but it doesn't change the reality - bytearray's remain special things in Tcl. Bytearrays were never texts and will never be. The fact that bytes are also representable as string (chars) in Tcl (without explicit conversion or cast due to EIAS) shall not be explained, as you suggest, that it is text in Tcl, because it is not.

Also I don't understand how it can be comparable with iso8859-1 (where, strictly seen the standard, it definitely misses chars 0x00-0x1F and 0x7F-0x9F, so I guess it need to fail with profile strict or replace with profile replace, and even if it doesn't right now, it may be changed later). Nor I understand how arguments like this may help by the issue. By the way, unicode is also not a real encoding, because similar to binary it simply forces tcl to handle channel as unicode string (objects with unicode representation), however it doesn't bother someone at all.


pooryorick added on 2024-06-22 10:13:39:

Also, if this new behaviour remains, "won't fix" is the wrong resolution, because it means "this is a bug we're keeping". "invalid" is the right resolution because it means, "this is not a bug".


pooryorick added on 2024-06-22 09:23:52:

It's important that people understand what "binary" really is in this text-oriented language we call Tcl. Sugarcoating names with the term "binary" leads to more subtle confusion among Tcl programmers. The documentation should explain that in Tcl "binary" data *is* text, and that working with binary data means constraining oneself to using only the first 256 Unicode characters. If anything, "binary" should be removed as an encoding (but should remain for -translation). That would have the benefit that people would no longer make the mistake of using "-encoding binary" when they meant "-translation binary", which is a pretty common silent-corruption-producting footgun.

Programmers should take the time to read the docs and understand why iso-8859-1 is used to read and write text that can be interpreted as "binary" data.


jan.nijtmans added on 2024-06-21 08:03:52:

I see the discussion going into two directions. One is about the introspection, which is this ticket. "Invalid" is the wrong resolution here (sorry, Nathan). We need to decide whether we decide to roll-back the behavior to be the same as in 8.6 or keep it the way it is now.

I'll discuss this with the other TCT members, and whether "Fixed" or "Won't Fix" is the best here. My personal preference is "Won't fix", but I don't have the power alone to make that decision. In case of "Won't fix" the documentation needs to be improved, that's - at least - my conclusion from this ticket. In case of "Fixed" it shouldn't be a just reversion of Nathan's work: generally, I like it, because it gives "iso8859-1" the same optimization as "binary" had!

Stay tuned. It make take a while, not high prio.

Keeping this ticket open, until a decision has been made. At least, documentation should be improved.


cjmcdonald added on 2024-06-21 07:26:48:
If the user sets an encoding of binary (whether using -encoding or -translation), then for fconfigure to return something different is unexpected and surprising:

It's undocumented in the man pages.

It's not at all obvious to a user that encoding binary and encoding iso8859-1 are the same thing.  As a user, setting encoding binary means that I want byte values to be transfered unchanged.  Setting character set encoding iso8859-1 means that I want character values to be transformed between that encoding and Tcl's internal representation.  If they happen to be implemented in the same way then I'd expect Tcl to manage that internally, rather than exposing it to the user.

Changing the value which fconfigure returns causes needless problems to scripts or GUIs which perform introspection of channel settings.

pooryorick added on 2024-06-20 22:23:58:

As Jan said, this is correct behaviour. "binary" is an alias for "iso8859-1".


jan.nijtmans added on 2024-06-19 19:47:05:

Actually, since "-encoding binary" is an alias for "-encoding iso8859-1", this new behavior is not strange at all. You most likely want to use "-translation binary". That's also the same as "-encoding iso8859-1", but also sets -translation to lf and -eofchar to {}. See: https://core.tcl-lang.org/tcl/info/f375dcda79?ln=255-265


jan.nijtmans added on 2024-06-19 19:35:02:

This change - most likely - comes from [b66d50b4d45f6f36|this] commit. Ticket: [fa3d9fd818fa0072]

Assigning to Nathan Coulter.

The second issue you mention, the system encoding change on Windows Server 2019 is totally unrelated. Better file another ticket for that.