Tcl Source Code

View Ticket
Login
Ticket UUID: 2564363
Title: string equality testing does not cover encoding variants
Type: Bug Version: None
Submitter: dgp Created on: 2009-02-04 15:40:07
Subsystem: 10. Objects Assigned To: dgp
Priority: 7 High Severity: Minor
Status: Open Last Modified: 2017-06-05 17:07:18
Resolution: None Closed By: nobody
    Closed on:
Description:
% set a [encoding convertfrom identity \x21]
!
% set b [encoding convertfrom identity \xc0\xA1]
!
% expr {$a eq $b}
0
% string equal $a $b
0
User Comments: dgp added on 2017-06-05 17:07:18:
This may be fixed now by fixing

https://core.tcl.tk/tcl/tktview/67aa9a207037ae67f9014b544c3db34fa732f2dc

At least the expression of the problem is now changed.

msofer added on 2009-12-11 09:50:56:
just passing it around ....

dkf added on 2009-05-01 15:26:05:
I agree about eliminating the identity “encoding”; it's nothing but trouble.

jenglish added on 2009-02-05 07:50:31:
> The core problem is that Tcl has not
> made up its mind whether such variants
> are to be accepted or rejected.

My vote: rejected.  More specifically: invalid UTF-8 octet sequences as a Tcl_Obj* string value  leads to undefined behavior.  (Not an error, _undefined behavior_. Tcl should not be required to detect such conditions, either.)

And [encoding convertfrom identity] has to go.

ferrieux added on 2009-02-05 07:31:51:
It seems we are clearly violating advice given in (uncompiled) StringEqualCmd:

   * Remember to keep code here in some sync with the byte-compiled versions
   * in tclExecute.c (INST_STR_EQ, INST_STR_NEQ and INST_STR_CMP as well as

Indeed the code near that comment tries two shimmer-less comparisons (ByteArray and String) before resorting to GetStringFromObj(). While the equivalent code in TEBC goes straight to it...
One first idea would be to respect the above advice and stick to perfet eval-compile symmetry.
However, shimmering is just a matter of speed and shouldn't affect the external EIAS semantics.
So it might be better to remove that comment, since the lack of symmetry helps highlight a nasty bug.

Now to the bug itself: here we have two pure-strep's (results from [encoding convertfrom identity]), which are two different UTF-8 strings. By calling [string length] on them we compute their String(unicode) intrep, which happen to be the same. Then when we call the non-compiled variant of [string equal] we hit the shimmer-less case, comparing on the (equal) unicode strings.

So it seems we have a situation similar to non-canonical lists that would be compared on their List intreps.
The solution would be to add similarly an "isCanonical" flag to the String intrep, and take that flag into account in the "fast-track" comparisons (ie forbid such comparisons on all but canonical Strings).

More generally this would be the case of any intrep that is "fast-tracked" in one of the equality tests and fails to record deviation from canonicity.

dgp added on 2009-02-04 23:16:37:
here the identity encoding is simply
a tool for introducing the encoding
variants to be tested.

The core problem is that Tcl has not
made up its mind whether such variants
are to be accepted or rejected.

dkf added on 2009-02-04 23:09:58:
IMO the core problem is the identity encoding.

dgp added on 2009-02-04 22:55:05:
extended example showing the inconsistency
due to shimmering, and differing definitions
of equality for different intreps:

% set a [encoding convertfrom identity \x21]
!
% set b [encoding convertfrom identity \xc0\xa1]
!
% set s string; # Force direct evaluation - no compile!
string
% $s equal $a $b
0
% string length $a; # Convert to the "string" objType
1
% string length $b
1
% $s equal $a $b
1