Tcl Source Code

Check-in [6e3632ede5]
Login
Bounty program for improvements to Tcl and certain Tcl packages.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Minor optimization in UTF-8 handling, and add some comments describing how Tcl_UniCharToUtf() handles surrogates.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | core-8-branch
Files: files | file ages | folders
SHA3-256: 6e3632ede52cca96951cb49c6f6acfbe763273058346c1b99d85735c9da84553
User & Date: jan.nijtmans 2019-03-02 16:52:42
Context
2019-03-02
17:21
Add build with -DTCL_UTF_MAX=6 to travis CI. Also fix 2 gcc compiler-warnings occurring with -DTCL_U... check-in: 9b2a385a0f user: jan.nijtmans tags: core-8-branch
16:53
Merge 8.7 check-in: e766d23655 user: jan.nijtmans tags: trunk
16:52
Minor optimization in UTF-8 handling, and add some comments describing how Tcl_UniCharToUtf() handle... check-in: 6e3632ede5 user: jan.nijtmans tags: core-8-branch
16:35
Fix some "scan.test" test-cases when TCL_UTF_MAX=4. Wrongly resolved merge-conflict in previous che... check-in: 8d1ff82057 user: jan.nijtmans tags: core-8-6-branch
16:08
Merge 8.6 (one forgotten adaptation of surrogate handling, only compiled on Cygwin) check-in: 4f781560c5 user: jan.nijtmans tags: core-8-branch
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to generic/tclScan.c.

877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
	case 'c':
	    /*
	     * Scan a single Unicode character.
	     */

	    offset = TclUtfToUniChar(string, &sch);
	    i = (int)sch;
#if TCL_UTF_MAX == 4
	    if (((sch & 0xFC00) == 0xD800) && (offset < 3)) {
		offset += TclUtfToUniChar(string+offset, &sch);
		i = (((i<<10) & 0x0FFC00) + 0x10000) + (sch & 0x3FF);
	    }
#endif
	    string += offset;
	    if (!(flags & SCAN_SUPPRESS)) {
		objPtr = Tcl_NewWideIntObj(i);






|
|







877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
	case 'c':
	    /*
	     * Scan a single Unicode character.
	     */

	    offset = TclUtfToUniChar(string, &sch);
	    i = (int)sch;
#if TCL_UTF_MAX <= 4
	    if ((sch >= 0xD800) && (offset < 3)) {
		offset += TclUtfToUniChar(string+offset, &sch);
		i = (((i<<10) & 0x0FFC00) + 0x10000) + (sch & 0x3FF);
	    }
#endif
	    string += offset;
	    if (!(flags & SCAN_SUPPRESS)) {
		objPtr = Tcl_NewWideIntObj(i);

Changes to generic/tclUtf.c.

108
109
110
111
112
113
114













115
116
117
118
119
120
121
...
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
 *---------------------------------------------------------------------------
 *
 * Tcl_UniCharToUtf --
 *
 *	Store the given Tcl_UniChar as a sequence of UTF-8 bytes in the
 *	provided buffer. Equivalent to Plan 9 runetochar().
 *













 * Results:
 *	The return values is the number of bytes in the buffer that were
 *	consumed.
 *
 * Side effects:
 *	None.
 *
................................................................................
 *	The caller must ensure that the source buffer is long enough that this
 *	routine does not run off the end and dereference non-existent memory
 *	looking for trail bytes. If the source buffer is known to be '\0'
 *	terminated, this cannot happen. Otherwise, the caller should call
 *	Tcl_UtfCharComplete() before calling this routine to ensure that
 *	enough bytes remain in the string.
 *
 *	If TCL_UTF_MAX == 4, special handling of Surrogate pairs is done:
 *	For any UTF-8 string containing a character outside of the BMP, the
 *	first call to this function will fill *chPtr with the high surrogate
 *	and generate a return value of 0. Calling Tcl_UtfToUniChar again
 *	will produce the low surrogate and a return value of 4. Because *chPtr
 *	is used to remember whether the high surrogate is already produced, it
 *	is recommended to initialize the variable it points to as 0 before
 *	the first call to Tcl_UtfToUniChar is done.
 *
 * Results:
 *	*chPtr is filled with the Tcl_UniChar, and the return value is the
 *	number of bytes from the UTF-8 string that were consumed.






>
>
>
>
>
>
>
>
>
>
>
>
>







 







|


|
|







108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
...
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
 *---------------------------------------------------------------------------
 *
 * Tcl_UniCharToUtf --
 *
 *	Store the given Tcl_UniChar as a sequence of UTF-8 bytes in the
 *	provided buffer. Equivalent to Plan 9 runetochar().
 *
 *	Special handling of Surrogate pairs is handled as follows:
 *	When this function is called for ch being a high surrogate,
 *	the first byte of the 4-byte UTF-8 sequence is produced and
 *	the function returns 1. Calling the function again with a
 *	low surrogate, the remaining 3 bytes of the 4-byte UTF-8
 *	sequence is produced, and the function returns 3. The buffer
 *	is used to remember the high surrogate between the two calls.
 *
 *	If no low surrogate follows the high surrogate (which is actually
 *	illegal), this can be handled reasonably by calling Tcl_UniCharToUtf
 *	again with ch = -1. This will produce a 3-byte UTF-8 sequence
 *	representing the high surrogate.
 *
 * Results:
 *	The return values is the number of bytes in the buffer that were
 *	consumed.
 *
 * Side effects:
 *	None.
 *
................................................................................
 *	The caller must ensure that the source buffer is long enough that this
 *	routine does not run off the end and dereference non-existent memory
 *	looking for trail bytes. If the source buffer is known to be '\0'
 *	terminated, this cannot happen. Otherwise, the caller should call
 *	Tcl_UtfCharComplete() before calling this routine to ensure that
 *	enough bytes remain in the string.
 *
 *	Special handling of Surrogate pairs is handled as follows:
 *	For any UTF-8 string containing a character outside of the BMP, the
 *	first call to this function will fill *chPtr with the high surrogate
 *	and generate a return value of 1. Calling Tcl_UtfToUniChar again
 *	will produce the low surrogate and a return value of 3. Because *chPtr
 *	is used to remember whether the high surrogate is already produced, it
 *	is recommended to initialize the variable it points to as 0 before
 *	the first call to Tcl_UtfToUniChar is done.
 *
 * Results:
 *	*chPtr is filled with the Tcl_UniChar, and the return value is the
 *	number of bytes from the UTF-8 string that were consumed.