Tcl Source Code

Check-in [b02df08680]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Comment Comment Tcl_UniCharToUtf() better, what happens handling surrogates. Add type cast in tclUtf.c, making actual check clearer
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | core-8-branch
Files: files | file ages | folders
SHA3-256: b02df08680dba410413bbdba9cd6643ab6caefb82e1efca737df581d8f5086c3
User & Date: jan.nijtmans 2019-03-18 22:32:37.925
Context
2019-03-20
22:45
Fix Tcl_UtfToUniCharDString() function, handling invalid byte at the end of the string: Not quite co... check-in: 3e8ada19f5 user: jan.nijtmans tags: core-8-branch
2019-03-18
22:34
Merge 8.7 check-in: 48c676b649 user: jan.nijtmans tags: trunk
22:32
Comment Comment Tcl_UniCharToUtf() better, what happens handling surrogates. Add type cast in tclUtf... check-in: b02df08680 user: jan.nijtmans tags: core-8-branch
22:14
enlarge a few small buffers, which could overflow using Unicode characters > /UFFFF. Eliminate some... check-in: 41c373ad8f user: jan.nijtmans tags: core-8-branch
Changes
Unified Diff Ignore Whitespace Patch
Changes to doc/Utf.3.
129
130
131
132
133
134
135
136
137
138

139
140
141
142
143
144
145
.PP
\fBTCL_UTF_MAX\fR is the maximum number of bytes that it takes to
represent one Unicode character in the UTF-8 representation.
.PP
\fBTcl_UniCharToUtf\fR stores the character \fIch\fR as a UTF-8 string
in starting at \fIbuf\fR.  The return value is the number of bytes stored
in \fIbuf\fR. If ch is a high surrogate (range U+D800 - U+DBFF), then
the return value will be 0 and nothing will be stored. If you still
want to produce UTF-8 output for it (even though knowing it's an illegal
code-point on its own), just call \fBTcl_UniCharToUtf\fR again using ch = -1.

.PP
\fBTcl_UtfToUniChar\fR reads one UTF-8 character starting at \fIsrc\fR
and stores it as a Tcl_UniChar in \fI*chPtr\fR.  The return value is the
number of bytes read from \fIsrc\fR.  The caller must ensure that the
source buffer is long enough such that this routine does not run off the
end and dereference non-existent or random memory; if the source buffer
is known to be null-terminated, this will not happen.  If the input is







|
|
|
>







129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
.PP
\fBTCL_UTF_MAX\fR is the maximum number of bytes that it takes to
represent one Unicode character in the UTF-8 representation.
.PP
\fBTcl_UniCharToUtf\fR stores the character \fIch\fR as a UTF-8 string
in starting at \fIbuf\fR.  The return value is the number of bytes stored
in \fIbuf\fR. If ch is a high surrogate (range U+D800 - U+DBFF), then
the return value will be 1 and a single byte in the range 0xF0 - 0xF4
will be stored. If you still want to produce UTF-8 output for it (even
though knowing it's an illegal code-point on its own), just call
\fBTcl_UniCharToUtf\fR again specifying ch = -1.
.PP
\fBTcl_UtfToUniChar\fR reads one UTF-8 character starting at \fIsrc\fR
and stores it as a Tcl_UniChar in \fI*chPtr\fR.  The return value is the
number of bytes read from \fIsrc\fR.  The caller must ensure that the
source buffer is long enough such that this routine does not run off the
end and dereference non-existent or random memory; if the source buffer
is known to be null-terminated, this will not happen.  If the input is
Changes to generic/tclScan.c.
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
				 * required. */
{
    int gotXpg, gotSequential, value, i, flags;
    char *end;
    Tcl_UniChar ch = 0;
    int objIndex, xpgSize, nspace = numVars;
    int *nassign = TclStackAlloc(interp, nspace * sizeof(int));
    char buf[TCL_UTF_MAX+1] = "";
    Tcl_Obj *errorMsg;		/* Place to build an error messages. Note that
				 * these are messy operations because we do
				 * not want to use the formatting engine;
				 * we're inside there! */

    /*
     * Initialize an array that records the number of times a variable is







|







257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
				 * required. */
{
    int gotXpg, gotSequential, value, i, flags;
    char *end;
    Tcl_UniChar ch = 0;
    int objIndex, xpgSize, nspace = numVars;
    int *nassign = TclStackAlloc(interp, nspace * sizeof(int));
    char buf[TCL_UTF_MAX + 1] = "";
    Tcl_Obj *errorMsg;		/* Place to build an error messages. Note that
				 * these are messy operations because we do
				 * not want to use the formatting engine;
				 * we're inside there! */

    /*
     * Initialize an array that records the number of times a variable is
Changes to generic/tclUtf.c.
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
		/* produce high surrogate, advance source pointer */
		*chPtr = 0xD800 + high;
		return 1;
	    }
#else
	    *chPtr = (((byte & 0x07) << 18) | ((src[1] & 0x3F) << 12)
		    | ((src[2] & 0x3F) << 6) | (src[3] & 0x3F));
	    if ((*chPtr - 0x10000) <= 0xFFFFF) {
		return 4;
	    }
#endif
	}

	/*
	 * A four-byte-character lead-byte not followed by two trail-bytes







|







442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
		/* produce high surrogate, advance source pointer */
		*chPtr = 0xD800 + high;
		return 1;
	    }
#else
	    *chPtr = (((byte & 0x07) << 18) | ((src[1] & 0x3F) << 12)
		    | ((src[2] & 0x3F) << 6) | (src[3] & 0x3F));
	    if ((unsigned)(*chPtr - 0x10000) <= 0xFFFFF) {
		return 4;
	    }
#endif
	}

	/*
	 * A four-byte-character lead-byte not followed by two trail-bytes