Tcl Source Code

View Ticket
Login
Ticket UUID: d10d6ddf295864389294b966163a23a58b9a1e72
Title: UTF-16 Unicode escapes should continue to work in version 9.0
Type: Bug Version: 9.0b2
Submitter: bhaible Created on: 2024-07-01 16:30:16
Subsystem: 45. Parsing and Eval Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2024-08-22 11:32:48
Resolution: Wont Fix Closed By: jan.nijtmans
    Closed on: 2024-08-22 11:32:48
Description:
In Tcl 8.x, the only way to write Unicode characters in a way that is independent
of the encoding of the current locale is through the \uhhhh\uhhhh syntax, where the
first \uhhhh denotes a "high surrogate" and the second \uhhhh denotes a "low surrogate".

See:
$ tclsh8.6
% puts "\ud83d\ude03"
😃
% puts "\U0001F603"
�

Reference: https://www.tcl-lang.org/man/tcl8.6/TclCmd/Tcl.htm#M28

This does not work any more:

$ tclsh9.0
% puts "\ud83d\ude03"
error writing "stdout": invalid or incomplete multibyte or wide character
% puts "\U0001F603"
😃

Reference: https://www.tcl.tk/man/tcl9.0/TclCmd/Tcl.html#M28

Use of unencoded UTF-8 characters does not work when [encoding system] is different from utf-8.

But it is important to have a way to denote literal strings with Unicode characters that work
  - regardless of current locale, and
  - regardless of the Tcl version.
This is important e.g. in message catalogs (such as those generated by GNU gettext).
As it is now,
  - Unencoded UTF-8 characters don't work, because of e.g. ISO-8859-1 locales in Tcl 8.x,
  - \U000hhhhh doesn't work, because of Tcl 8.6,
  - \uhhhh\uhhhh doesn't work, because of Tcl 9.0beta2.

I'm not saying that all of https://core.tcl-lang.org/tips/doc/trunk/tip/497.md is wrong.
I'm only saying that the parser should recognize the special case of \uhhhh\uhhhh with
a high surrogate followed by a low surrogate and convert this to a 32-bit Unicode
code point.
User Comments: jan.nijtmans added on 2024-08-22 11:32:48:

> @bhaible, one more time, parsing of u syntax has not changed between Tcl 8.6 (any patchlevel) and Tcl 9

What changed in Tcl 9 is the "UtfToUtf" encoding used when inputting or outputting utf-8. This functions is assumed to convert invalid utf-8 to valid utf-8 whenever possible. That also means that surrogate pairs are combined when written to stdout. Tcl 9 doesn't do that any more, which means that the surrogate pairs are kept as-is, and (due to the 'strict' profile) result in an exception.

So, indeed, the parsing didn't change between 8.6 and 9.0. It's the output conversion which joins the surrogates, and - therefore - appears to do the right thing in 8.6.

My advice: Handle it like this


apnadkarni added on 2024-08-13 13:38:26:

It is about modifying the parser of string literals (and only string literals), to recognize a syntax that was understood by Tcl 8.6.

@bhaible, one more time, parsing of u syntax has not changed between Tcl 8.6 (any patchlevel) and Tcl 9. Yet another example,

% package require Tcl
8.6.14
% string equal [string index "\ud83d\ude03" 0] "\ud83d"
1

If 8.6 was combining surrogates during parsing, the above would have returned 0. Nevertheless, since you are insisting that is the issue, I traced through the code for 8.6 and 9. In both releases, the sequence is recognized in identical fashion, and results in generation of a string with two surrogate code points. You can compare the code in tclParse.c if you are still not satisfied.

Further to illustrate what has been said in earlier with regards to the issue you are facing being related to the output, I've modified the tcl8 profile in branch apn-profile-tcl8-surrogates to do the combining of surrogates on output. Here's the result:

% fconfigure stdout -profile tcl8 -encoding utf-8
%  puts "\ud83d\ude03"
😃

Note use of the profile to change from the strict profile. Doing the change to the tcl8 profile could be classified as a bug or a RFE. The latter would require TCT approval so no guarantees it would be done (and the current implementation is incomplete with respect to fragmented sequences).

I also do not know if that would meet your needs but at least I hope you are convinced the Tcl syntax parser does not have to be changed.


juliannoble2 added on 2024-08-13 08:43:33:
> This ticket is not about using cesu-8 anywhere (cesu-8 being a recipe for bugs and security vulnerabilities, see https://en.wikipedia.org/wiki/CESU-8).
> It is about modifying the parser of string literals (and only string literals), to recognize a syntax that was understood by Tcl 8.6.

Ok - I disagree that cesu-8 isn't perfectly valid to use *internally* and nothing about my post was about reading or writing cesu-8 encodings externally - but I admit to not understanding your usecase or the details well enough - so I'll bow out sorry.

bhaible added on 2024-08-13 07:51:49:
> Isn't this a usecase that fixing the cesu-8 encoding would fix?

This ticket is not about using cesu-8 anywhere (cesu-8 being a recipe for bugs and security vulnerabilities, see https://en.wikipedia.org/wiki/CESU-8).

It is about modifying the parser of string literals (and only string literals), to recognize a syntax that was understood by Tcl 8.6.

juliannoble2 added on 2024-08-13 07:01:55:
Isn't this a usecase that fixing the cesu-8 encoding would fix?

https://core.tcl-lang.org/tcl/tktview/2f22a7364d

I have a test function that decodes \uFF30\ud83d\ude03\U1f400\UFF31 into <wide-P><smiley><mouse><wide-Q>   but currently relies on some cesu code provide by dasBrain on the chat - and still has bugs due to the cesu-8 codepoint collisions

apnadkarni added on 2024-08-12 17:09:10:

@bhaible, since you mentioned 29 years of Tcl :-) compare Tcl 8.6.4 with Tcl 9 when it comes to scanning the uxxxx sequences.

% package req Tcl
8.6.4
% scan \ud83d\ude03 %c%c
55357 56835

and

% package req Tcl
9.0b3
% scan \ud83d\ude03 %c%c
55357 56835

The parsing/scanning is identical as you see.

On the other hand, newer Tcl 8.6.x are different (x >= 10?)

% package req Tcl
8.6.14
% scan \ud83d\ude03 %c%c
128515 {}

Note how 8.6.14 differs from 8.6.4. This is what I was referring to earlier.

Now with reference to your puts example, as you can see, the scan characters are NOT combined in either 8.6.4 or 9.0 so the presumption that it is a parsing/scanning problem in 9.0 does not seem likely given your statement that 8.6 never had this issue. Rather as Jan hints at in his response, the difference seems to be that Tcl 9 is stricter to the Unicode standard about where surrogates may appear in I/O. In particular, surrogates should never appear anywhere except in a UTF-16 stream, not even in UTF-8 or UTF-16 streams. Thus the error on the puts command. And there was plenty of debate on this and the migration pain this might cause but the closer adherence to the Unicode standard on exchange of data was deemed important.

With respect to the original problem you are having, I understand where you are coming from. Opinion: However, the fact is that Tcl 8.6, no matter what version, never properly supported characters outside the BMP, no matter stuff that "happened to work". That includes 8.6.11 and later that combine surrogates in some instances and not others. Applications that expect to run on 8.6 should not be using those characters in message files (or any source files for that matter.) While it may have been a workaround in Tcl 8 in cases where post-BMP were mandated, there is no reason to continue doing so now that Tcl 9 with full Unicode is here. But as I said, that's my opinion.


bhaible added on 2024-08-12 16:55:39:
> There's a portable way to specify an emoji:
> % encoding convertfrom utf-8 \xF0\x9F\x98\x81
> 😁

So, what you suggest is to replace string literals with expressions. So, to write, e.g. in a message catalog file,

::msgcat::mcset fr "smiley" [expr { $tcl_version < 9 ? "\ud83d\ude03" : "\U0001F603" }]

Is that what you meant?

In my use-case it would work. I don't know whether it would work anywhere where string literals can be used.

But in any case, the migration cost (from version 8.6 to 9.0) is non-negligible for Tcl users.

jan.nijtmans added on 2024-08-12 15:23:44:

There's a portable way to specify an emoji:

% encoding convertfrom utf-8 \xF0\x9F\x98\x81
😁

This will work with Tcl 8.7, 9.0, and newer Tcl 8.6 versions.

I'm quite reluctant on implementing your (@bhaible) suggestion. It would mean that the following won't give an exception any more in Tcl 9.0:

% puts stdout \uD83D\uDE01
error writing "stdout": invalid or incomplete multibyte or wide character

This behavioral change in Tcl 9.0 was proposed in TIP #619, and accepted. This means, a new TIP will need to be written to undo that change. I'm afraid there won't be enough support for that.

Hope this helps, Jan Nijtmans


bhaible added on 2024-08-12 11:02:53:
> an result of 8.x not fully supporting non-BMP characters

Nope, Tcl 8.6 was supporting non-BMP characters just like Java does for 29 years. Namely, character literals are supported via \uxxxx\uyyyy escape sequences, and when output to an UTF-8 stream, the non-BMP characters (2 storage units) are converted to UTF-8. And yes, for programs that look at the storage units inside strings it makes a difference whether a character is in the BMP (1 unit) or outside the BMP (2 units).

> I don't think it advisable to break this consistency.

You are arguing with the internal representation of strings.

I am arguing with the external representation, how string literals are represented in files.

There does NOT need to be a 1:1 correspondence.

Also I am arguing with a migration issue. You are saying "applications must move to 9.0" but are not giving enough thought, how to make migration as easy as possible for your users.

> use UTF-8 and specify the `-encoding utf-8` option instead of being dependent on locales.

My software (GNU gettext) writes the message catalogs, that contain string literals, but does not control or influence how the applications are run. Therefore your advice is not applicable.

apnadkarni added on 2024-08-12 10:22:59:

Right or wrong, each element of a Tcl string in Tcl 9 is a Unicode code point, not a Unicode character. Correspondingly, the Unicode escape sequences map to individual code points. I don't think it advisable to break this consistency.

The 8.x-9.0 issue you are facing is really an result of 8.x not fully supporting non-BMP characters, thereby requiring hacks. In fact, I don't think (not confirmed) the snippets you gave even work in 8.6 versions prior to 8.6.10 or thereabouts. Consider the following fundamental inconsistency in 8.6 non-BMP support recently raised on the chat:

% string length "\ud83d\ude03"
2
% llength [split "\ud83d\ude03" ""]
1

My take is that applications that need non-BMP support must move to 9.0. Otherwise, niggling issues like the above inconsistency will keep arising. If at all that is not possible, use UTF-8 and specify the -encoding utf-8 option instead of being dependent on locales.