575.md at [4292d3446a]

Login

File tip/575.md artifact f674a506bb part of check-in 4292d3446a


# TIP 575: Switchable Tcl\_UtfCharComplete()/Tcl\_UtfNext()/Tcl\_UtfPrev()
	Author:         Jan Nijtmans <[email protected]>
	State:          Final
	Type:           Project
	Vote:           Done
	Tcl-Version:    8.7
	Tcl-Branch:     tip-575
	Vote-Summary:	Accepted 3/0/0
	Votes-For:	JN, KW, SL
	Votes-Against:	none
	Votes-Present:	none
-----
# Abstract

This TIP is a successor to [TIP #542](542.md), resolving a corner-case not realized at that moment.

[TIP #542](542.md) allows stub-enabled extensions to be compiled with `-DTCL_UTF_MAX=4`, while
still working with a Tcl compiled with `-DTCL_UTF_MAX=3`. All functions (e.g. `Tcl_UtfToUniChar`)
which use a `Tcl_UniChar`, change behavior with the value of `TCL_UTF_MAX`: If `TCL_UTF_MAX=3`,
then, supplying a 4-byte UTF-8 character to this function will return 1, and `*chPtr` will
point to a high surrogate. If `TCL_UTF_MAX=4`, then, this function will return 4, and `*chPtr`
will point to the full Unicode character. This works by supplying two different stub entries
and making a switch controlled by the value of `TCL_UTF_MAX`.

The functions `Tcl_UtfNext()/Tcl_UtfPrev()` don't have a `Tcl_UniChar` parameter, still there's
an expected coupling with the function `Tcl_UtfToUniChar`: If `TCL_UTF_MAX=4` then we would
expect `Tcl_UtfNext()` to be able to jump forward 4 bytes, while with `TCL_UTF_MAX=3`,
`Tcl_UtfNext()` can only jump forward with maximum 3 bytes. The same for `Tcl_UtfPrev()`.

The function `Tcl_UtfCharComplete()`, which is coupled with the function `Tcl_UtfToUniChar`
(indicating if there are enough bytes available for `Tcl_UtfToUniChar()` to be called),
has the same problem. Making this function switchable has the advantage that this function
now can be used to protect calls to `Tcl_UtfNext()` too, for extensions compiled with
whatever value of `TCL_UTF_MAX`.

# Specification

Implement new functions `Tcl_UtfCharComplete()`/`Tcl_UtfNext()`/`Tcl_UtfPrev()`, which can
jump 4 bytes forward resp. back, so it is possible to jump over UTF-8 characters > U+FFFF
in one step in stead of two.

The new `Tcl_UtfNext()`/`Tcl_UtfPrev()` functions will get their own new entries in the
stub table. So, extensions (however rare) using `Tcl_UtfNext()`/`Tcl_UtfPrev()` but
compiled against Tcl 8.6 headers will keep their original behavior.

The new `Tcl_UtfCharComplete()` will behave almost identical to the old one. The only
difference is when it encounters a starting byte between 0xF0 and 0xF5: Then it will return
true only when at least 4 bytes are available.

If an extension is compiled with `-DTCL_UTF_MAX=4` or with `-DTCL_NO_DEPRECATED`, then
`Tcl_UtfCharComplete()` will start behaving like described in this TIP, if not then it
will behave exactly as in Tcl 8.6.

Documentation regarding `Tcl_UtfCharComplete()` is adapted, stating that this function
can now be used to protect `Tcl_UtfNext()` calls too.

# Implementation

Implementation is in branch tip-575

# Compatibility

As long as Tcl and/or extensions are both compiled with `-DTCL_UTF_MAX=3` (which is
the default in Tcl 8.x) or `-DTCL_UTF_MAX=4` (as in Tcl 9.x), nothing changes.
The difference can only be noted in extensions which are compiled using a different
`TCL_UTF_MAX` value than Tcl.

# Copyright

This document has been placed in the public domain.