Check-in [40c11941ea]

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Explain why some TIP #389 proposed changes are upwards compatible. Remove description of Tcl_WinUtfToTChar/Tcl_WinTCharToUtf (implementation-only) change.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 40c11941eac684efde2ce97ce4a229e8e74b255d5339dfa2fa7199ff0dd77fb7
User & Date: jan.nijtmans 2018-04-04 09:29:07
Context
2018-04-04
10:51
Add a "Rejected Alternatives" section check-in: b39ae9464b user: jan.nijtmans tags: trunk
09:29
Explain why some TIP #389 proposed changes are upwards compatible. Remove description of Tcl_WinUtfToTChar/Tcl_WinTCharToUtf (implementation-only) change. check-in: 40c11941ea user: jan.nijtmans tags: trunk
2018-03-30
17:55
New TIP 506 check-in: c80f0474ca user: dgp tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to tip/389.md.

63
64
65
66
67
68
69













70
71
72
73
74
75
76
..
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
...
151
152
153
154
155
156
157
158
159

160
161
162
163
164
165
166
167
168
169
170
171
	 > \* _Tcl\_UniCharToLower_

	 > \* _Tcl\_UniCharToTitle_

	 > \* _Tcl\_UniCharToUpper_

	 > \* _Tcl\_GetUniChar_














 * Extend tclUniData.c to include all Unicode 10.0 characters up to
   **U\+02FA20**.  A special case will be made for the functions
   _Tcl\_UniCharIsGraph_ and _Tcl\_UniCharIsPrint_ for the characters in the
   range **U\+0E0100** - **U\+0E01EF**, otherwise it would almost double the
   Unicode table size.

................................................................................
 * If Tcl is compiled with -DTCL\_UTF\_MAX=6, use a different TCL\_STUB\_MAGIC
   value. Since extensions compiled with -DTCL\_UTF\_MAX=6 are binary
   incompatible with normally-compiled Tcl, this causes extensions
   compiled with this same options no longer being loadable in normal
   Tcl and reverse. Note that TCL\_UTF\_MAX=6 compiles are still not officially
   supported, a lot of additional fixes are needed to make it work right.

 * Change the Windows-only functions Tcl\_WinTchar2Utf and Tcl\_WinUtf2TChar
   to using only the Win32 API functions in its implementation. This means
   that the Tcl encoding system no longer needs to be initialized in
   order for those functions to work. (This cannot be done with TCL\_UTF\_MAX=3,
   because the Windows API might produce 4-byte UTF-8 sequences, which Tcl
   cannot handle then)

# Compatibility

As long as no Surrogates or characters >= **U\+010000** are used, all
functions behave exactly the same as before. The only way that
_Tcl\_UniCharToUtf_ can produce a 4-byte output is when Surrogates or
characters >= **U\+010000** are used.

................................................................................
    0        -> (So we cannot access the lower surrogate separately)

So, the "string length" of a Unicode character >U+FFFF is 2, and if you try to
split it in two separate characters that won't work: It will then be split
in a character with length 2 (the original one) and another character with
length 0 (the empty string).

Also note that the regexp engine still cannot handle Unicode characters >U+FFFF,
this engine will handle those as if they consist of 2 separate characters.


Those caveats are planned to be handled in "part 2" (TIP #497)

# Reference Implementation

A reference implementation is available in the [tip-389 branch]
(https://core.tcl.tk/tk/timeline?r=tip-389).

# Copyright

This document has been placed in the public domain.







>
>
>
>
>
>
>
>
>
>
>
>
>







 







<
<
<
<
<
<
<







 







|
|
>












63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
..
98
99
100
101
102
103
104







105
106
107
108
109
110
111
...
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
	 > \* _Tcl\_UniCharToLower_

	 > \* _Tcl\_UniCharToTitle_

	 > \* _Tcl\_UniCharToUpper_

	 > \* _Tcl\_GetUniChar_

	At first sight, this looks like a binary incompatibility, but in fact this
	is upwards compatible. Since in C, function calls generally transfer the
	result of a function call in a special register (the Accumulator). When
	compiling an extension using Tcl 8.6 headers, the caller expects the
	accumulator to contain a 16-bit result, while the remaining 48 bits (the
	Accumulator generally is 64-bit) are undefined. When the extension is
	run under Tcl 8.7, 16 more bits of the accumulator content are now defined
	(generally all zero's). The effect is that all characters >= **U\+010000**
	(which are not supported on Tcl 8.6) are now mapped to characters in the first
	unicode plane, but that's all. Re-compiling the extension using Tcl 8.7 headers
	might enable full Unicode support for the extension, if a 32-bit register is used
	to store the result.

 * Extend tclUniData.c to include all Unicode 10.0 characters up to
   **U\+02FA20**.  A special case will be made for the functions
   _Tcl\_UniCharIsGraph_ and _Tcl\_UniCharIsPrint_ for the characters in the
   range **U\+0E0100** - **U\+0E01EF**, otherwise it would almost double the
   Unicode table size.

................................................................................
 * If Tcl is compiled with -DTCL\_UTF\_MAX=6, use a different TCL\_STUB\_MAGIC
   value. Since extensions compiled with -DTCL\_UTF\_MAX=6 are binary
   incompatible with normally-compiled Tcl, this causes extensions
   compiled with this same options no longer being loadable in normal
   Tcl and reverse. Note that TCL\_UTF\_MAX=6 compiles are still not officially
   supported, a lot of additional fixes are needed to make it work right.








# Compatibility

As long as no Surrogates or characters >= **U\+010000** are used, all
functions behave exactly the same as before. The only way that
_Tcl\_UniCharToUtf_ can produce a 4-byte output is when Surrogates or
characters >= **U\+010000** are used.

................................................................................
    0        -> (So we cannot access the lower surrogate separately)

So, the "string length" of a Unicode character >U+FFFF is 2, and if you try to
split it in two separate characters that won't work: It will then be split
in a character with length 2 (the original one) and another character with
length 0 (the empty string).

Also note that the regexp engine still cannot really handle Unicode characters >U+FFFF,
it will handle those as if they consist of 2 separate characters. Most usage of
regular expressions won't notice the difference.

Those caveats are planned to be handled in "part 2" (TIP #497)

# Reference Implementation

A reference implementation is available in the [tip-389 branch]
(https://core.tcl.tk/tk/timeline?r=tip-389).

# Copyright

This document has been placed in the public domain.