Check-in [b39ae9464b]

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add a "Rejected Alternatives" section
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: b39ae9464b6e9030247de4fa1e0bb6c217fa40f0d5e3671ae2368883293715b8
User & Date: jan.nijtmans 2018-04-04 10:51:43
Context
2018-04-09
07:37
TIP #389 voting now in progress check-in: 65cd82c2e4 user: jan.nijtmans tags: trunk
2018-04-04
10:51
Add a "Rejected Alternatives" section check-in: b39ae9464b user: jan.nijtmans tags: trunk
09:29
Explain why some TIP #389 proposed changes are upwards compatible. Remove description of Tcl_WinUtfToTChar/Tcl_WinTCharToUtf (implementation-only) change. check-in: 40c11941ea user: jan.nijtmans tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to tip/389.md.

152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
...
168
169
170
171
172
173
174














175
176
177
178
    % scan %c \U100000
    1048576  -> (this is the correct Unicode character)
    % string length [string index "a\U100000b" 1]
    2        -> (the Unicode character has length 2)
    % string length [string index "a\U100000b" 2]
    0        -> (So we cannot access the lower surrogate separately)

So, the "string length" of a Unicode character >U+FFFF is 2, and if you try to
split it in two separate characters that won't work: It will then be split
in a character with length 2 (the original one) and another character with
length 0 (the empty string).

Also note that the regexp engine still cannot really handle Unicode characters >U+FFFF,
it will handle those as if they consist of 2 separate characters. Most usage of
regular expressions won't notice the difference.
................................................................................
Those caveats are planned to be handled in "part 2" (TIP #497)

# Reference Implementation

A reference implementation is available in the [tip-389 branch]
(https://core.tcl.tk/tk/timeline?r=tip-389).















# Copyright

This document has been placed in the public domain.







|







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>




152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
...
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
    % scan %c \U100000
    1048576  -> (this is the correct Unicode character)
    % string length [string index "a\U100000b" 1]
    2        -> (the Unicode character has length 2)
    % string length [string index "a\U100000b" 2]
    0        -> (So we cannot access the lower surrogate separately)

So, the "string length" of a Unicode character >= **U\+010000** is 2, and if you try to
split it in two separate characters that won't work: It will then be split
in a character with length 2 (the original one) and another character with
length 0 (the empty string).

Also note that the regexp engine still cannot really handle Unicode characters >U+FFFF,
it will handle those as if they consist of 2 separate characters. Most usage of
regular expressions won't notice the difference.
................................................................................
Those caveats are planned to be handled in "part 2" (TIP #497)

# Reference Implementation

A reference implementation is available in the [tip-389 branch]
(https://core.tcl.tk/tk/timeline?r=tip-389).

# Rejected Alternatives

It would have been possible to give the new _Tcl\_GetUniChar_ and friends
a new stub entry and to deprecate the original one, as was done with
_Tcl\_Backslash_. However, _Tcl\_Backslash_ originally only returned
an ASCII character, which needed to be extended to UniChar. UniChar's
< **U\+01000** common in Tcl, Unicode Characters >= **U\+010000**
are rare and don't behave well in Tcl 8.6 anyway. Casts from _Tcl\_UniChar_
to int don't cause a warning because all _Tcl\_UniChar_'s fit in the
32-bit int range. On the other hand, casting "char" to _Tcl\_UniChar_
can result in surprising Unicode characters **U\+FF??** if char is
a signed type (as in most platforms). That's why _Tcl\_Backslash_ had
to be handled differently.

# Copyright

This document has been placed in the public domain.