TIP 542: New encodings: UTF-16, UCS-2, Support for switchable (Androwish-compatible) Full Unicode support.

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.
Author:         Jan Nijtmans <[email protected]>
Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Created:        10-May-2019
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.7
Tcl-Branch:     utf-max

Abstract

This TIP proposes to add more encodings, and being able to switch Tcl between Full Unicode mode (TCL_UTF_MAX>3, almost compatible with Androwish) and current partial Unicode mode (as far as TIP #389 goes, using TCL_UTF_MAX=3)

Rationale

Tcl currently can be compiled in 3 different modes: using TCL_UTF_MAX=3, TCL_UTF_MAX=4 or TCL_UTF_MAX=6. The first 2 are actually equal now in Tcl 8.7 (since TIP #389). Using TCL_UTF_MAX=6 is actually overkill, since no utf-8 character consists of more than 4 bytes.

Therefore it makes sense to reduce this to only two modes: TCL_UTF_MAX=3 means being fully compatible with Tcl 8.6, while TCL_UTF_MAX=4 means compatibility with the Androwish-version of Tcl. Defining TCL_UTF_MAX=6 results in a valid compilation as well (functioning the same as TCL_UTF_MAX=4), only some buffer-sizes will be 2 bytes larger than necessary.

Androwish made the choice to use an (at that time) un-supported Tcl mode: Changing the size of the Tcl_UniChar type using TCL_UTF_MAX=6. This causes a binary incompatibility which results that all extensions need to be re-compiled with TCL_UTF_MAX=6 as well. This TIP proposes to add a supported TCL_UTF_MAX=4 compilation mode to Tcl, which has the same effect as the earlier unsupported TCL_UTF_MAX=6, but without the need to re-compile all extensions. The need for re-compilation of all extensions is eliminated by putting the 32-bit versions of the Tcl_UniChar-related functions in different stub entries than the 16-bit versions. This way, 99% of all extensions compiled with TCL_UTF_MAX=3 keep functioning as before without the need for re-compilation.

The default compilation mode for Tcl will continue to be TCL_UTF_MAX=3, which is 100% upwards compatible with Tcl 8.6.

Specification

This document proposes:

If Tcl is compiled with either -DTCL_UTF_MAX=4 or -DTCL_NO_DEPRECATED, those functions will no longer be available for extensions, as well as in Tcl 9.0.

Compatibility

As long as Tcl is compiled with -DTCL_UTF_MAX=3, this is fully upwards compatible.

When Tcl is compiled with -DTCL_UTF_MAX=4, this is at the Tcl level, compatible with the Androwish-version of Tcl with one exception: In Androwish the "unicode" encoding is 32-bit, in Tcl it continues to be 16-bit, an alias for "utf-16". At the C-API level, it's upwards compatible with Tcl 8.6 in TCL_UTF_MAX=6 mode, except for the functions marked above as deprecated. Those functions will be gone.

Caveats

Reference Implementation

A reference implementation is available in the utf-max branch. https://core.tcl.tk/tcl/timeline?r=utf-max

Copyright

This document has been placed in the public domain.