Tcl Source Code

View Ticket
Login
Ticket UUID: 5dabd088eecefd81cf86cf213a2e3e7f25b1329b
Title: TIP 573 discussion
Type: RFE Version:
Submitter: dgp Created on: 2020-04-28 21:46:10
Subsystem: - New Builtin Commands Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2022-04-01 09:08:26
Resolution: Out of Date Closed By: jan.nijtmans
    Closed on: 2022-04-01 09:08:26
Description:
TIP 573 proposes rejection of unpaired surrogates in strings.

This ticket opened to contain comments and discussion on the idea.
User Comments: jan.nijtmans added on 2022-04-01 09:08:26: (text/x-fossil-wiki)
Since TIP #573 is withdrawn, closing.

jan.nijtmans added on 2020-05-18 14:57:48: (text/x-fossil-wiki)
Hm. I read that on Linux system, the names of filesystem paths are managed as char arrays, which can be arbitrary bytes.....

So, on Linux, having set the system encoding as UTF-8, we can have two differenty files in the same directory: One with the byte sequence "\xC2\xA2", another one with the byte sequence "\xA2". Both will be decoded as "ยข" by Tcl. It's exactly the same problem type of problem.

jan.nijtmans added on 2020-05-18 14:34:20: (text/x-fossil-wiki)
> I read that on Window systems, the names of filesystem paths

> are managed as WCHAR arrays, which can be any arbitrary sequence

> of 16-bit values


I don't think we need to worry about that much. See second note here:
[https://docs.microsoft.com/en-us/windows/win32/intl/surrogates-and-supplementary-characters]

dgp added on 2020-05-06 20:07:12:
Today I encountered what might be another use case to be sure we preserve
the ability to have values that are arbitrary UCS-2 sequences.  I read that
on Window systems, the names of filesystem paths are managed as WCHAR
arrays, which can be any arbitrary sequence of 16-bit values.  We may need
UCS-2 arrays or the equivalent to handle those without loss or corruption.

dgp added on 2020-04-30 15:37:31:
The other thought I have is that if we are aiming for
increased Unicode conformance, we have other 
examples to address.  I think at a higher priority.

If we are considering rejecting the byte sequence

0xED 0xA0 0x80

because it is invalid UTF-8 because it represents
a surrogate, I think we have to consider how we react
to all the other byte sequences that are also invalid
in UTF-8.

\xC1
\xF5

etc.

Unicode recommendations would have us deal
with all of these as replacement character U+FFFD
(and then into the disputes about how many :) )

I'd like us to demand valid Modified UTF-8 at least
in Tcl 9.  At least I think so. Unless exploration of
the idea reveals a strong reason otherwise.

dgp added on 2020-04-30 15:27:56:
The trickiest issue here is that Tcl strings have two uses.

First, they are a way to store and process text, and for that purpose
increased Unicode conformance is the good way to move the
capabilities of Tcl programming into agreement with the expectations
of text programming in other languages.

However, Tcl strings are also the common foundation of all values in
the language.  "Everything is a string".  If we remove some of the
alphabet we can use to make up those strings, we are removing some
of the expressible values in that representation set, which can lead
to trouble.

The same issue came up in the 8.0 -> 8.1 transition.  In 8.0 "Everything
is a string" meant "Everything is representable as a sequence of bytes",
and in order to support all values, Tcl 8.0 was written to support strings
covering that entire set of sequences of any byte value.

When 8.1 came out, the new support for international character sets
brought with it a new rule that the "string" values were conceptually
"Any sequence of UCS-2 characters", and when looking at
(char *) arguments something similar to UTF-8 would be used to
encode that concept.

This change meant that at least for some commands, script writers
had to become aware of the distinction between "bytes" and "chars".
Using the C routines properly demanded even more awareness.

If some user of Tcl 8.0 was accustomed to passing the bytes
"\xC2\x80" to Tcl and have them reliably treated and preserved
as a sequence of two things, Tcl 8.1 broke that contract, because
large parts of Tcl now saw and treated that as U+0080. To make
up for it, Tcl 8.1 provided bytearrays, so that anyone needing to
send arbitrary byte sequences into Tcl free from corruption had
a way to do it.

In parallel with that, I think if we revise our string definition to
exclude the  UCS-2 characters that are surrogates in UTF-16,
we likewise need some new facility so that any Tcl users that
have need to work with UCS-2 arrays can continue to do so.

I think there's at least some reason to believe UCS-2 arrays are
a concept Tcl programmers need.  If I imagine reading from
a channel encoded in utf-16, there can be a point where I've
read the high surrogate but not yet the low surrogate that follows
it.  In that moment, I need to store in the variable holding "what I've
read so far from the channel" a value that includes an unpaired
surrogate.  Just like I can currently use a bytearray to hold an
incomplete UTF-8 sequence partially read from a binary channel.

This probably means expanding the set of concepts for
Tcl programming to include bytes, code units, and characters.

I think the historical example of 8.0 -> 8.1 establishes that
such a change is not absolutely forbidden for 8.7, but it is
a change of remarkable scale and potential for disruption.
If we contemplate it, I think we are obligated to supply clear
migration supports and high quality advice how to navigate it.

Does that make sense?

dgp added on 2020-04-28 21:58:01:
This points the way toward improved Unicode conformance, which is overall a good thing.

I have several thoughts which aren't terribly well organized yet. I'll try to get my 
thinking straight and clear before I spew too much here.  Most of the issues are
not about where to go, but how and when and how completely to get there.

jan.nijtmans added on 2020-04-28 21:49:55: (text/x-fossil-wiki)
One thing I'm interested in to know: Would this be OK for 8.7?   I think it should, even though it introduces a potential incompatibility:  Anyone using unpaired surrogates should deal with their own consequences.  But if there is a lot of objection, it could be screduled for 9.0 as well