|Title:||TIP 573 discussion|
|Submitter:||dgp||Created on:||2020-04-28 21:46:10|
|Subsystem:||- New Builtin Commands||Assigned To:||nobody|
|Status:||Open||Last Modified:||2020-05-18 14:57:48|
TIP 573 proposes rejection of unpaired surrogates in strings. This ticket opened to contain comments and discussion on the idea.
jan.nijtmans added on 2020-05-18 14:57:48:
Hm. I read that on Linux system, the names of filesystem paths are managed as char arrays, which can be arbitrary bytes.....
So, on Linux, having set the system encoding as UTF-8, we can have two differenty files in the same directory: One with the byte sequence "\xC2\xA2", another one with the byte sequence "\xA2". Both will be decoded as "¢" by Tcl. It's exactly the same problem type of problem.
jan.nijtmans added on 2020-05-18 14:34:20:
> I read that on Window systems, the names of filesystem paths
> are managed as WCHAR arrays, which can be any arbitrary sequence
> of 16-bit values
I don't think we need to worry about that much. See second note here: https://docs.microsoft.com/en-us/windows/win32/intl/surrogates-and-supplementary-characters
dgp added on 2020-05-06 20:07:12:
Today I encountered what might be another use case to be sure we preserve the ability to have values that are arbitrary UCS-2 sequences. I read that on Window systems, the names of filesystem paths are managed as WCHAR arrays, which can be any arbitrary sequence of 16-bit values. We may need UCS-2 arrays or the equivalent to handle those without loss or corruption.
dgp added on 2020-04-30 15:37:31:
The other thought I have is that if we are aiming for increased Unicode conformance, we have other examples to address. I think at a higher priority. If we are considering rejecting the byte sequence 0xED 0xA0 0x80 because it is invalid UTF-8 because it represents a surrogate, I think we have to consider how we react to all the other byte sequences that are also invalid in UTF-8. \xC1 \xF5 etc. Unicode recommendations would have us deal with all of these as replacement character U+FFFD (and then into the disputes about how many :) ) I'd like us to demand valid Modified UTF-8 at least in Tcl 9. At least I think so. Unless exploration of the idea reveals a strong reason otherwise.
dgp added on 2020-04-30 15:27:56:
The trickiest issue here is that Tcl strings have two uses. First, they are a way to store and process text, and for that purpose increased Unicode conformance is the good way to move the capabilities of Tcl programming into agreement with the expectations of text programming in other languages. However, Tcl strings are also the common foundation of all values in the language. "Everything is a string". If we remove some of the alphabet we can use to make up those strings, we are removing some of the expressible values in that representation set, which can lead to trouble. The same issue came up in the 8.0 -> 8.1 transition. In 8.0 "Everything is a string" meant "Everything is representable as a sequence of bytes", and in order to support all values, Tcl 8.0 was written to support strings covering that entire set of sequences of any byte value. When 8.1 came out, the new support for international character sets brought with it a new rule that the "string" values were conceptually "Any sequence of UCS-2 characters", and when looking at (char *) arguments something similar to UTF-8 would be used to encode that concept. This change meant that at least for some commands, script writers had to become aware of the distinction between "bytes" and "chars". Using the C routines properly demanded even more awareness. If some user of Tcl 8.0 was accustomed to passing the bytes "\xC2\x80" to Tcl and have them reliably treated and preserved as a sequence of two things, Tcl 8.1 broke that contract, because large parts of Tcl now saw and treated that as U+0080. To make up for it, Tcl 8.1 provided bytearrays, so that anyone needing to send arbitrary byte sequences into Tcl free from corruption had a way to do it. In parallel with that, I think if we revise our string definition to exclude the UCS-2 characters that are surrogates in UTF-16, we likewise need some new facility so that any Tcl users that have need to work with UCS-2 arrays can continue to do so. I think there's at least some reason to believe UCS-2 arrays are a concept Tcl programmers need. If I imagine reading from a channel encoded in utf-16, there can be a point where I've read the high surrogate but not yet the low surrogate that follows it. In that moment, I need to store in the variable holding "what I've read so far from the channel" a value that includes an unpaired surrogate. Just like I can currently use a bytearray to hold an incomplete UTF-8 sequence partially read from a binary channel. This probably means expanding the set of concepts for Tcl programming to include bytes, code units, and characters. I think the historical example of 8.0 -> 8.1 establishes that such a change is not absolutely forbidden for 8.7, but it is a change of remarkable scale and potential for disruption. If we contemplate it, I think we are obligated to supply clear migration supports and high quality advice how to navigate it. Does that make sense?
dgp added on 2020-04-28 21:58:01:
This points the way toward improved Unicode conformance, which is overall a good thing. I have several thoughts which aren't terribly well organized yet. I'll try to get my thinking straight and clear before I spew too much here. Most of the issues are not about where to go, but how and when and how completely to get there.
jan.nijtmans added on 2020-04-28 21:49:55:
One thing I'm interested in to know: Would this be OK for 8.7? I think it should, even though it introduces a potential incompatibility: Anyone using unpaired surrogates should deal with their own consequences. But if there is a lot of objection, it could be screduled for 9.0 as well