Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Minor optimization in UTF-8 handling, and add some comments describing how Tcl_UniCharToUtf() handles surrogates. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | core-8-branch |
Files: | files | file ages | folders |
SHA3-256: |
6e3632ede52cca96951cb49c6f6acfbe |
User & Date: | jan.nijtmans 2019-03-02 16:52:42.665 |
Context
2019-03-02
| ||
17:21 | Add build with -DTCL_UTF_MAX=6 to travis CI. Also fix 2 gcc compiler-warnings occurring with -DTCL_U... check-in: 9b2a385a0f user: jan.nijtmans tags: core-8-branch | |
16:53 | Merge 8.7 check-in: e766d23655 user: jan.nijtmans tags: trunk | |
16:52 | Minor optimization in UTF-8 handling, and add some comments describing how Tcl_UniCharToUtf() handle... check-in: 6e3632ede5 user: jan.nijtmans tags: core-8-branch | |
16:35 | Fix some "scan.test" test-cases when TCL_UTF_MAX=4. Wrongly resolved merge-conflict in previous che... check-in: 8d1ff82057 user: jan.nijtmans tags: core-8-6-branch | |
16:08 | Merge 8.6 (one forgotten adaptation of surrogate handling, only compiled on Cygwin) check-in: 4f781560c5 user: jan.nijtmans tags: core-8-branch | |
Changes
Changes to generic/tclScan.c.
︙ | ︙ | |||
877 878 879 880 881 882 883 | case 'c': /* * Scan a single Unicode character. */ offset = TclUtfToUniChar(string, &sch); i = (int)sch; | | | | 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 | case 'c': /* * Scan a single Unicode character. */ offset = TclUtfToUniChar(string, &sch); i = (int)sch; #if TCL_UTF_MAX <= 4 if ((sch >= 0xD800) && (offset < 3)) { offset += TclUtfToUniChar(string+offset, &sch); i = (((i<<10) & 0x0FFC00) + 0x10000) + (sch & 0x3FF); } #endif string += offset; if (!(flags & SCAN_SUPPRESS)) { objPtr = Tcl_NewWideIntObj(i); |
︙ | ︙ |
Changes to generic/tclUtf.c.
︙ | ︙ | |||
108 109 110 111 112 113 114 115 116 117 118 119 120 121 | *--------------------------------------------------------------------------- * * Tcl_UniCharToUtf -- * * Store the given Tcl_UniChar as a sequence of UTF-8 bytes in the * provided buffer. Equivalent to Plan 9 runetochar(). * * Results: * The return values is the number of bytes in the buffer that were * consumed. * * Side effects: * None. * | > > > > > > > > > > > > > | 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | *--------------------------------------------------------------------------- * * Tcl_UniCharToUtf -- * * Store the given Tcl_UniChar as a sequence of UTF-8 bytes in the * provided buffer. Equivalent to Plan 9 runetochar(). * * Special handling of Surrogate pairs is handled as follows: * When this function is called for ch being a high surrogate, * the first byte of the 4-byte UTF-8 sequence is produced and * the function returns 1. Calling the function again with a * low surrogate, the remaining 3 bytes of the 4-byte UTF-8 * sequence is produced, and the function returns 3. The buffer * is used to remember the high surrogate between the two calls. * * If no low surrogate follows the high surrogate (which is actually * illegal), this can be handled reasonably by calling Tcl_UniCharToUtf * again with ch = -1. This will produce a 3-byte UTF-8 sequence * representing the high surrogate. * * Results: * The return values is the number of bytes in the buffer that were * consumed. * * Side effects: * None. * |
︙ | ︙ | |||
265 266 267 268 269 270 271 | * The caller must ensure that the source buffer is long enough that this * routine does not run off the end and dereference non-existent memory * looking for trail bytes. If the source buffer is known to be '\0' * terminated, this cannot happen. Otherwise, the caller should call * Tcl_UtfCharComplete() before calling this routine to ensure that * enough bytes remain in the string. * | | | | | 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | * The caller must ensure that the source buffer is long enough that this * routine does not run off the end and dereference non-existent memory * looking for trail bytes. If the source buffer is known to be '\0' * terminated, this cannot happen. Otherwise, the caller should call * Tcl_UtfCharComplete() before calling this routine to ensure that * enough bytes remain in the string. * * Special handling of Surrogate pairs is handled as follows: * For any UTF-8 string containing a character outside of the BMP, the * first call to this function will fill *chPtr with the high surrogate * and generate a return value of 1. Calling Tcl_UtfToUniChar again * will produce the low surrogate and a return value of 3. Because *chPtr * is used to remember whether the high surrogate is already produced, it * is recommended to initialize the variable it points to as 0 before * the first call to Tcl_UtfToUniChar is done. * * Results: * *chPtr is filled with the Tcl_UniChar, and the return value is the * number of bytes from the UTF-8 string that were consumed. |
︙ | ︙ |