TIP 575: Switchable Tcl_UtfCharComplete()/Tcl_UtfNext()/Tcl_UtfPrev()

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Tcl-Version:    8.7
Tcl-Branch:     tip-575
Vote-Summary:	Accepted 3/0/0
Votes-For:	JN, KW, SL
Votes-Against:	none
Votes-Present:	none

Abstract

This TIP is a successor to TIP #542, resolving a corner-case not realized at that moment.

TIP #542 allows stub-enabled extensions to be compiled with -DTCL_UTF_MAX=4, while still working with a Tcl compiled with -DTCL_UTF_MAX=3. All functions (e.g. Tcl_UtfToUniChar) which use a Tcl_UniChar, change behavior with the value of TCL_UTF_MAX: If TCL_UTF_MAX=3, then, supplying a 4-byte UTF-8 character to this function will return 1, and *chPtr will point to a high surrogate. If TCL_UTF_MAX=4, then, this function will return 4, and *chPtr will point to the full Unicode character. This works by supplying two different stub entries and making a switch controlled by the value of TCL_UTF_MAX.

The functions Tcl_UtfNext()/Tcl_UtfPrev() don't have a Tcl_UniChar parameter, still there's an expected coupling with the function Tcl_UtfToUniChar: If TCL_UTF_MAX=4 then we would expect Tcl_UtfNext() to be able to jump forward 4 bytes, while with TCL_UTF_MAX=3, Tcl_UtfNext() can only jump forward with maximum 3 bytes. The same for Tcl_UtfPrev().

The function Tcl_UtfCharComplete(), which is coupled with the function Tcl_UtfToUniChar (indicating if there are enough bytes available for Tcl_UtfToUniChar() to be called), has the same problem. Making this function switchable has the advantage that this function now can be used to protect calls to Tcl_UtfNext() too, for extensions compiled with whatever value of TCL_UTF_MAX.

Specification

Implement new functions Tcl_UtfCharComplete()/Tcl_UtfNext()/Tcl_UtfPrev(), which can jump 4 bytes forward resp. back, so it is possible to jump over UTF-8 characters > U+FFFF in one step in stead of two.

The new Tcl_UtfNext()/Tcl_UtfPrev() functions will get their own new entries in the stub table. So, extensions (however rare) using Tcl_UtfNext()/Tcl_UtfPrev() but compiled against Tcl 8.6 headers will keep their original behavior.

The new Tcl_UtfCharComplete() will behave almost identical to the old one. The only difference is when it encounters a starting byte between 0xF0 and 0xF5: Then it will return true only when at least 4 bytes are available.

If an extension is compiled with -DTCL_UTF_MAX=4 or with -DTCL_NO_DEPRECATED, then Tcl_UtfCharComplete() will start behaving like described in this TIP, if not then it will behave exactly as in Tcl 8.6.

Documentation regarding Tcl_UtfCharComplete() is adapted, stating that this function can now be used to protect Tcl_UtfNext() calls too.

Implementation

Implementation is in branch tip-575

Compatibility

As long as Tcl and/or extensions are both compiled with -DTCL_UTF_MAX=3 (which is the default in Tcl 8.x) or -DTCL_UTF_MAX=4 (as in Tcl 9.x), nothing changes. The difference can only be noted in extensions which are compiled using a different TCL_UTF_MAX value than Tcl.

Copyright

This document has been placed in the public domain.