Tcl Source Code

View Ticket
Login
Ticket UUID: 0d61d3a2bb905178694d6cde8147039974600274
Title: Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16
Type: Bug Version: 8.7
Submitter: pooryorick Created on: 2021-03-06 18:26:33
Subsystem: 18. Commands M-Z Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Critical
Status: Closed Last Modified: 2022-07-13 14:49:12
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2022-07-13 14:49:12
Description: (text/x-fossil-wiki)
This report was inspired by [debd088e48998a75].

Various Unicode code points do not correspond to any abstract character.  Such
code points include private-use code points, surrogate code points,
noncharacters, specials, tag characters, and unassigned code points.  Although
none of these code points correspond to any abstract character, the Unicode
standard takes care not to forbid these code points in a Unicode string.
Rather than saying that such characters and non-characters can not occur in a
text, it simply labels some of them as "restricted interchange".  Elsewhere,
the standard explicitly permits their use for internal purposes.

A surrogate pair is a sequence of two code points used in utf-16 to represent a
character beyond those that could be represented in ucs-2.  When a surrogate
pair is encountered during the decoding of text encoded in utf-16, it is
translated into another code point.

These surrogate code points were added to Unicode to ensure no overlap between
their use in utf-16 and other **external** uses.

According to the Unicode standard,

    "Surrogate  code  points  cannot be conformantly  interchanged  using  Unicode encoding forms. They do not correspond to Unicode scalar values and thus do not have well-formed representations in any Unicode encoding form."

The Unicode standard also states that

    "A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character."

What it does not say is that a surrogate code point can not be used internally
for some arbitrary purpose.  We also know that the "abstract character"
prohibition doesn't imply a prohibition on internal use because the
Unicode standard explicitly says this about noncharacter code points:

    "The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly."


What can be derived from all this is that if a code point does not map to an
abstract character, in the absence of any other interpretation it should be
treated merely as a code point, but not disallowed.

Tcl correctly handles the case of a single surrogate code point by simply
including the code point in the string as requested by the script author:

<code><verbatim>
string length xxx\ud83eyyy
#-> 7
</verbatim></code>


But when two surrogate code points occur successively Tcl does something
bizarre:  It translates the two code points into another code point as if it
was a utf-16 decoder.

<code><verbatim>
expr {
    [encoding convertto utf-8 \ud83e\udd1d]
    eq
    [encoding convertto utf-8 \U1f91d]
}
# -> 1
</verbatim></code>

This is a big mistake.

Tcl is not a utf-16 decoder.  For transcoding utf-16 Tcl provides 

<code><verbatim>
encoding convertfrom/convertto utf-16
</verbatim></code>

Tcl is also not a Unicode application.  Rather, it is a tool for creating
Unicode applications.  In fact, it is even further removed than that:  It is a
tool for creating tools that create Unicode applications.  As such, Tcl must be
**Unicode-agnostic**, not **Unicode-conformant**.  As an all-purpose tool for
creating Unicode-conformant applications, Tcl must provide the capability to
create arbitrary sequences of Unicode code points so that application authors
can create systems that meet their needs.  Restricting scripts to working only
with Unicode characters rather than Unicode code points deprives those scripts
of significant utility.

What this means, for example, is that rather than performing normalization, Tcl
should provide routines that perform whatever normalization a
particular application requires.  In the case at hand, though, what it means
is that Tcl **should not** try to conform to the Unicode standard by becoming a
quasi utf-16 transcoder.  The application (script) author should be free to
craft whatever Unicode strings they see fit to craft for internal use.  If the
author wants the HANDSHAKE character, there is an easy way to do that:

<code><verbatim>
set string \U1f91d
</verbatim></code>

Likewise, if the author wishes to create a sequence of two code points which
both happen to be surrogates, there should be an easy way to do that:

<code><verbatim>
string length \ud83e\udd1d
#-> 2
</verbatim></code>

These two code points should subsequently continue to be treated as two code
points.  This facilitates the creation of any possible sequence of code points.

The script level is not the place to expose details of utf-16 encoding.
Rather, it should remain a pure and flexible Unicode environment where it is
possible to craft any sequence of code points.  It is the responsibility of the
application author to ultimately produce conformant text.  If a script attempts
to convert to utf-16 a sequence of code points that can't be represented in
utf-16, Tcl should then produce and error, as this is the boundary between
internal and external use.

Recent tickets hint at the troubles this new behaviour will cause.  One example
is [22324bcbdf].  The incorrect assumption in this ticket is that
<code>\ud83d\udca3</code>  should be interpreted as the BOMB character.  It
should not.  Rather, it should be interpreted as two Unicode code points that
each correspond to no abstract character.  String reversal of these two code
points then becomes trivial:  The result of reversing <code>\ud83d\udca3</code>
would be <code>\udca3\ud83d</code>

In short, the only time Tcl should do anything special with surrogate code
points is when it is converting to/from utf-16.  Treating the <code>\u</code>
notation of two consecutive surrogate code points as the character they encode
in utf-16 makes as much sense as as treating the occurrence of
<code>\xc0\x80</code> in a script as a <code>NULL</code> character, and would
have similar consequences.  This mode of operation cripples Tcl as a usable
Unicode environment.  Hopefully it will be backed out before it's too late.
User Comments: jan.nijtmans added on 2022-07-13 14:49:12: (text/x-fossil-wiki)
Since [https://core.tcl-lang.org/tips/doc/trunk/tip/619.md|TIP #619] is now Accepted and merged into 9.0, we can close this ticket.

Fixed in Tcl 9.0

pooryorick added on 2022-04-19 21:16:16: (text/x-fossil-wiki)
Tip [https://core.tcl-lang.org/tips/doc/trunk/tip/619.md|619] looks good!  Many thanks your continued work on this.  Clearly, getting Tcl ready to move beyond the BMP has been a monumental effort.  The future looks good!

jan.nijtmans added on 2022-04-18 17:26:50: (text/x-fossil-wiki)
A proposed fix for this is put together now in [https://core.tcl-lang.org/tips/doc/trunk/tip/619.md|TIP #619]

jan.nijtmans added on 2021-10-31 11:13:15: (text/x-fossil-wiki)
> Is there a plan to fix this?

In Tcl 8.7, due to the dual-value implementation of strings (UTF-8 and UTF-16), this cannot be fixed: it cannot be prevented that the internal UTF-16 implementation leaks through at script level. In Tcl 9.0, there's still room for changing this, since the internal implementation changed from UTF-16 to UTF-32. So, feel free to start a new branch based on 9.0 (and - maybe - a new TIP): this way it will become more clear what the implications of those changes are.

pooryorick added on 2021-10-28 11:44:52: (text/x-fossil-wiki)
Tip 573 was withdrawn but the following script currently still evaluates to <code>1</code>
in core_8_branch and in trunk. 

<code><verbatim>
expr {
    [encoding convertto utf-8 \ud83e\udd1d]
    eq
    [encoding convertto utf-8 \U1f91d]
}
</verbatim></code>

Is there a plan to fix this?

jan.nijtmans added on 2021-03-12 15:52:58: (text/x-fossil-wiki)
Since this ticket is already being hijacked anyway 😉 ...

First version of [https://core.tcl-lang.org/tips/doc/trunk/tip/600.md|Migration guide] is available now.

Don, I would like to invite you to be co-author. We can add more here as the Tcl 9 design becomes more clear.

pooryorick added on 2021-03-11 08:32:10:
I sincerely hope that rather than hacking makeshift solutions together and
propagating them forward in the name of backwards compatiblity, Tcl takes the
time to get the design right and implement a high-quality Unicode environment.
I'm eager to help with the gruntwork wherever time affords.

chw added on 2021-03-11 00:00:24:
Or to be explicit: in order to avoid ambiguities there must be a
Tcl_UniChar data type able to express at least 24 bits. Otherwise
are we doomed to define TIPs over functions over procedures over
conventions over API variations over defines and so on. And never
find ever a proper singular solution.

chw added on 2021-03-10 22:52:22:
Indeed are we now back to square one, since as long we have Tcl_UniChar
being a 16 bit entity, will we suffer from the complexities layed out
some time ago in this little unspectacular piece of
https://wiki.tcl-lang.org/page/Why+AndroWish+switched+to+TCL%5FUTF%5FMAX%3D6

As long as 16 bits are set, no way out of that blues.

pooryorick added on 2021-03-10 21:57:58:
I just read the proposal agaiin, and realized it contains this:

One exception to this has to be made. When using the escace sequence \uD800\uDC00, so a valid combination of an upper and a lower surrogate, in a script, there is no harm in translating that to the intended Unicode code point. In Tcl 8.6 and 8.7 there is no other way than that for specifying Emoji. Allowing this, provides a upgrade path for existing scripts handling Emoji. Starting with Tcl 8.7, the "\UXXXXXX" escape sequence should be used for this.

Such and exception ruins everything and brings us back to square one, where it isn't possible to compose a string of an arbitrary sequence of code points.  This is too high a price to pay for the ability to enter smiley faces.

pooryorick added on 2021-03-10 21:40:48: (text/x-fossil-wiki)
TIP 597 is more like it!  Considering a Unicode string to be a sequence of code
points rather than a sequence of characters, and allowing a string to be
composed of any sequence of code points at all is the perfect solution to this
issue.  Tcl can leave it to the various transcoders to decide which code point
sequences are allowed in the target encoding.  But since any sequence of
Unicode code points is in fact Unicode, <code>string is Unicode</code> is a
misnomer.  <code>string is character</code> would be a more accurate name.
Perhaps there could also be a <code>string is assigned</code> and even a
<code>string is surrogate</code>.

In my opinion, rather than silently replacing code points it can't work
with, an encoder should produce an error if it is handed a string it can't
properly encode.  The current practice of replacing invalid characters has
caused all sorts of trouble, and the sooner it is eliminated, the better.
Replacement can be implented at the script level if needed, but if it's
baked in, there's no escaping it.

Thanks, Jan, for the revisiting the design and finding a better approach, and
thanks, Don, for taking time to articulate the alternatives.

dgp added on 2021-03-10 18:01:11:
The Unicode Glossary

   https://unicode.org/glossary/

defines a Unicode Scalar Value.

It might do us good to use that term when that's what we mean for greater
clarity and conformity to language set by others.

"[string is unicode $s] returns 1 when all codepoints in $s are Unicode Scalar Values, 0 otherwise."

as a possible example.

jan.nijtmans added on 2021-03-10 13:28:32: (text/x-fossil-wiki)
Thanks, Don, for this clear explanation.

It inspired me to write [https://core.tcl-lang.org/tips/doc/trunk/tip/597.md|TIP #597], as a possible way out: No need to reduce the Tcl value set, but still the guarantee that produced "utf-8" conforms to the Unicode standard.

I'm closing this ticket now.

dgp added on 2021-03-09 17:00:05:
This is a dispute over the alphabet of Tcl's value set.  Jan is correct
to point to TIP 573 as the key issue.

All Tcl values have always been (conceptually) "strings", but the precise definition of what is a "string" has varied with time/release.  The last
large revolution in Tcl's string value definition was the 8.0 -> 8.1
transition.  In 8.0 a "string" was a sequence of zero or more bytes.
In 8.1, a "string" became a sequence of zero or more UCS-2 values, which
our documentation and other writings call "characters".

The migration plan for Tcl 9 that Jan has developed appears to aim for
a Tcl 9 definition of "string" as "a proper Unicode string".  This implies
the exclusion of values that are not valid Unicode.  The Unicode alphabet
does not include the surrogate codepoints.  They cannot be included in any
valid Unicode string, so on this plan they must no longer be in the Tcl
value alphabet.  Consequently all the existing tools that could create or
manipulate these values in earlier releases have to incompatibly change,
and there's a disruption for developers that has to be managed.

One factor in favor of a goal to have the Tcl 9 value
set be proper Unicode strings is that it moves Tcl more in the direction of
using standard components and tools and names and away from its frequent (bad?)
habit of inventing its own weird alternatives.  That has the potential to
make Tcl's code base marginally more attractive to experienced coders.

I think there are two categories of reasonable criticism.  First, for the first
time the Tcl value set is getting reduced.  There are values that were once
valid in Tcl that no longer will be.  This reduces the set of programs that
can be written, or easily written.  Manipulation of arbitrary UCS-2
sequences in a Tcl script becomes something unnatural in a language that 
refuses to directly represent those values.  This means writing a Tcl script
that does UTF-16 encoding/decoding is something that has no implementation
that rests easily on the capabilities of the language.  It can still be
accomplished with suitable use of [binary format|scan t*] (and the
inefficiencies inherent in that), but that's not the most satisfying answer
to many programmers.  It's a task where the language has to be worked around rather than used. From this perspective, the goal itself is flawed, and
we should change course and aim to have the Tcl 9 definition of Tcl strings
somehow continue to be a superset of the set of all UCS-2 sequences, and we
should retain all the tools we have to operate on such values.

The second category of reasonable criticism is that the Tcl 9 goal to
define proper Unicode strings as Tcl's value set is acceptable, but that
the current migration path to get there is bumpier than it ought to be.
In that realm, the more specific the complaint, and the offering of
alternatives is the best way forward.  Also, in this realm, the most useful
venue for hashing out the issues would be the production of migration guides
for Tcl programmers.  Forcing ourselves into the teaching role should force
us to uncover just where the rough edges are, and whether they can be sanded
down, or whether any are deal-breakers.

I tend to agree with the first criticism.  From pooryorick's submission
I think he does too.  Jan has done the work, though.  He's held the TIP
votes and got the requisite approvals.  Tcl 8.7 behaves the way it does
(mostly, modulo some errors along the way) because the TCT accepted the
changes.  Perhaps some of the TCT supporters of this path have not been
fully aware of these issues of disruption and
representation, and it might be worthwhile to raise discussions on TCLCORE,
but I think that's the better venue for confronting the dispute at this point
than a ticket asserting a bug.  For myself, I've taken the attitude for some
time now that no matter my reservations and distaste, if I cannot deliver a
better alternative, there's little value in being no more than a naysayer.

Pooryorick, if you want to pursue another approach, I'm willing to help
where and as I can.

I hope that's fair to everyone.

jan.nijtmans added on 2021-03-08 09:09:11: (text/x-fossil-wiki)
Can you please change the example:

<code><verbatim>
set string \u1f91d
</verbatim></code>
into
<code><verbatim>
set string \U1f91d
</verbatim></code>

From this I'm assuming this bug report references 8.7, since 8.6 doesn't accept the \U?????-form yet for code points > U+FFFF.

Tcl indeed interprets such escape sequences as you describe. It's a feature, not a bug. Internally, Tcl internally stores strings in two possible forms, one is [https://simonsapin.github.io/wtf-8/|WTF-8], the other one is UTF-16.
Unicode allows use of such forms internally, as long as when communicating with the outside world it's translated into correct UTF-8. Since 8.6.10, Tcl does this correctly when it can (when it consists of valid surrogate-pairs) but allows WTF-8 forms to escape when they contain unpaired surrogates.

I would like to deprecate such usage, see [https://core.tcl-lang.org/tips/doc/trunk/tip/573.md|TIP #573], but I don't think this would be accepted for 8.7. Maybe for 9.0, but - still - this TIP text is premature, fully open for discussion.

For 8.x, yes it's too late, since \uD83E\uDD1D is already a valid (internal) representation for 🤝. It's the only portable way to do that in Tcl 8.6, in Tcl you can simply use the 🤝 character directly [https://core.tcl-lang.org/tips/doc/trunk/tip/587.md|TIP #587] or you can use the \U1F91D escape sequence.

Can I close this ticket as "Fixed in 9.0"? If you find situations in which WTF-8 forms escape (like the two recent bugs Christian Werner reported), please file a bug report! I'm not claiming Tcl 8.6/8.7/9.0 is there yet ([https://core.tcl-lang.org/tips/doc/trunk/tip/587.md|TIP #575] is being voted on now), but it's getting better and better.