Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add a "Rejected Alternatives" section |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
b39ae9464b6e9030247de4fa1e0bb6c2 |
User & Date: | jan.nijtmans 2018-04-04 10:51:43.266 |
Context
2018-04-09
| ||
07:37 | TIP #389 voting now in progress check-in: 65cd82c2e4 user: jan.nijtmans tags: trunk | |
2018-04-04
| ||
10:51 | Add a "Rejected Alternatives" section check-in: b39ae9464b user: jan.nijtmans tags: trunk | |
09:29 | Explain why some TIP #389 proposed changes are upwards compatible. Remove description of Tcl_WinUtfToTChar/Tcl_WinTCharToUtf (implementation-only) change. check-in: 40c11941ea user: jan.nijtmans tags: trunk | |
Changes
Changes to tip/389.md.
︙ | ︙ | |||
152 153 154 155 156 157 158 | % scan %c \U100000 1048576 -> (this is the correct Unicode character) % string length [string index "a\U100000b" 1] 2 -> (the Unicode character has length 2) % string length [string index "a\U100000b" 2] 0 -> (So we cannot access the lower surrogate separately) | | > > > > > > > > > > > > > > | 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | % scan %c \U100000 1048576 -> (this is the correct Unicode character) % string length [string index "a\U100000b" 1] 2 -> (the Unicode character has length 2) % string length [string index "a\U100000b" 2] 0 -> (So we cannot access the lower surrogate separately) So, the "string length" of a Unicode character >= **U\+010000** is 2, and if you try to split it in two separate characters that won't work: It will then be split in a character with length 2 (the original one) and another character with length 0 (the empty string). Also note that the regexp engine still cannot really handle Unicode characters >U+FFFF, it will handle those as if they consist of 2 separate characters. Most usage of regular expressions won't notice the difference. Those caveats are planned to be handled in "part 2" (TIP #497) # Reference Implementation A reference implementation is available in the [tip-389 branch] (https://core.tcl.tk/tk/timeline?r=tip-389). # Rejected Alternatives It would have been possible to give the new _Tcl\_GetUniChar_ and friends a new stub entry and to deprecate the original one, as was done with _Tcl\_Backslash_. However, _Tcl\_Backslash_ originally only returned an ASCII character, which needed to be extended to UniChar. UniChar's < **U\+01000** common in Tcl, Unicode Characters >= **U\+010000** are rare and don't behave well in Tcl 8.6 anyway. Casts from _Tcl\_UniChar_ to int don't cause a warning because all _Tcl\_UniChar_'s fit in the 32-bit int range. On the other hand, casting "char" to _Tcl\_UniChar_ can result in surprising Unicode characters **U\+FF??** if char is a signed type (as in most platforms). That's why _Tcl\_Backslash_ had to be handled differently. # Copyright This document has been placed in the public domain. |