Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Tcl-Version: 8.7
Tcl-Branch: tip-575
Vote-Summary: Accepted 3/0/0
Votes-For: JN, KW, SL
Votes-Against: none
Votes-Present: none
Abstract
This TIP is a successor to TIP #542, resolving a corner-case not realized at that moment.
TIP #542 allows stub-enabled extensions to be compiled with -DTCL_UTF_MAX=4
, while
still working with a Tcl compiled with -DTCL_UTF_MAX=3
. All functions (e.g. Tcl_UtfToUniChar
)
which use a Tcl_UniChar
, change behavior with the value of TCL_UTF_MAX
: If TCL_UTF_MAX=3
,
then, supplying a 4-byte UTF-8 character to this function will return 1, and *chPtr
will
point to a high surrogate. If TCL_UTF_MAX=4
, then, this function will return 4, and *chPtr
will point to the full Unicode character. This works by supplying two different stub entries
and making a switch controlled by the value of TCL_UTF_MAX
.
The functions Tcl_UtfNext()/Tcl_UtfPrev()
don't have a Tcl_UniChar
parameter, still there's
an expected coupling with the function Tcl_UtfToUniChar
: If TCL_UTF_MAX=4
then we would
expect Tcl_UtfNext()
to be able to jump forward 4 bytes, while with TCL_UTF_MAX=3
,
Tcl_UtfNext()
can only jump forward with maximum 3 bytes. The same for Tcl_UtfPrev()
.
The function Tcl_UtfCharComplete()
, which is coupled with the function Tcl_UtfToUniChar
(indicating if there are enough bytes available for Tcl_UtfToUniChar()
to be called),
has the same problem. Making this function switchable has the advantage that this function
now can be used to protect calls to Tcl_UtfNext()
too, for extensions compiled with
whatever value of TCL_UTF_MAX
.
Specification
Implement new functions Tcl_UtfCharComplete()
/Tcl_UtfNext()
/Tcl_UtfPrev()
, which can
jump 4 bytes forward resp. back, so it is possible to jump over UTF-8 characters > U+FFFF
in one step in stead of two.
The new Tcl_UtfNext()
/Tcl_UtfPrev()
functions will get their own new entries in the
stub table. So, extensions (however rare) using Tcl_UtfNext()
/Tcl_UtfPrev()
but
compiled against Tcl 8.6 headers will keep their original behavior.
The new Tcl_UtfCharComplete()
will behave almost identical to the old one. The only
difference is when it encounters a starting byte between 0xF0 and 0xF5: Then it will return
true only when at least 4 bytes are available.
If an extension is compiled with -DTCL_UTF_MAX=4
or with -DTCL_NO_DEPRECATED
, then
Tcl_UtfCharComplete()
will start behaving like described in this TIP, if not then it
will behave exactly as in Tcl 8.6.
Documentation regarding Tcl_UtfCharComplete()
is adapted, stating that this function
can now be used to protect Tcl_UtfNext()
calls too.
Implementation
Implementation is in branch tip-575
Compatibility
As long as Tcl and/or extensions are both compiled with -DTCL_UTF_MAX=3
(which is
the default in Tcl 8.x) or -DTCL_UTF_MAX=4
(as in Tcl 9.x), nothing changes.
The difference can only be noted in extensions which are compiled using a different
TCL_UTF_MAX
value than Tcl.
Copyright
This document has been placed in the public domain.