Artifact [d74ea628a2]

Login

Artifact d74ea628a214bca1bb9aaaf9767ef607e01c75ce347479d6c6de9c30bc0dc00d:


# TIP 497: Full support for Unicode planes 1-16.
	Author:         Jan Nijtmans <[email protected]>
	State:          Final
	Type:           Project
	Vote:           Done
	Created:        23-Jan-2018
	Post-History:
	Discussions-To: Tcl Core list
	Keywords:       Tcl
	Tcl-Version:    9.0
	Tcl-Branch:     tip-497
	Vote-Summary:	Accepted 6/0/1
	Votes-For:		DKF, FV, JN, KBK, KW, SL
	Votes-Against:	none
	Votes-Present:	DGP
-----

# Abstract

This TIP proposes to add full support for Unicode planes 1-16, which
contain characters >= **U\+010000**, inclusive in the regexp
engine. Also, the caveats remaining from TIP #389 will be handled here.

# Rationale

In Tcl 8.7, running `[length "😊"]` gives the result 2. The reason for
this is that - internally - Tcl 8.x splits the character in two
surrogate characters, so they fit in a 16-bit character array.
This TIP (for Tcl 9.0) is meant to put an end to this workaround,
providing a real solution.

# Specification

This document proposes:

 * Change the internal "string" objType, such that it will store 32-bit
   characters in stead of 16-bit characters internally.

 * No longer consider Emoji to consist of 2 surrogate characters. So
   `[length "😊"]  == 1`

 * Modify the Tcl_UniChar type, such that - by default -
   `sizeof(Tcl_UniChar)` == 4 (was: 2).  All extensions assuming
   `sizeof(Tcl_UniChar)` == 2 will be affected by this.

It would be possible to introduce an additional 16-bit character objType
internally, this TIP doesn't implement or propose that. That could always
be added later, but at this moment it is questionable whether that would
give a serious advantage or not. Anyway, out of scope for this TIP.

# Compatibility

For Tcl 8.x, `sizeof(Tcl_UniChar)` == 2. All extensions making this
assumption will be affected, since for Tcl 9.0 that will not be true
any more. There are two ways extensions can be modified:

 * Either fix the places in the code where Tcl_UniChar is used in
   a non-portable way. The API introduced by TIP #548 can be used for that.

 * An alternative is compile the extension using `-DTCL_UTF_MAX=3`. This
   will restore `sizeof(Tcl_UniChar) == 2` as visible for the extension.
   Almost all (non-deprecated) Tcl API's can still be used by the extension,
   even the ones using `Tcl_UniChar`, except: `Tcl_NewUnicodeObj()`,
   `Tcl_SetUnicodeObj()`, `Tcl_GetUnicodeFromObj()`, `Tcl_GetUnicode()`.
   There is no need to re-compile Tcl using `-DTCL_UTF_MAX=3` for the
   extension to work.

This change introduces a potential binary incompatibility compared to Tcl 9.0a3
and earlier: All extensions using any of the 4 following functions need to be recompiled:

  * `Tcl_NewUnicodeObj()`

  * `Tcl_SetUnicodeObj()`

  * `Tcl_GetUnicodeFromObj()`

  * `Tcl_GetUnicode()`

None of the (battery-included) extensions included by Tcl (e.g. Thread, tdbc ...)
are affected. Extensions need to be aware that those 4 functions interface
directly with the internal "string" objType, which stores 16-bit characters
in Tcl 8.x and 32-bit characters in Tcl 9.x.

# Reference Implementation

A reference implementation is available in  the **tip-497** branch.
<https://core.tcl-lang.org/tcl/timeline?r=tip-497>

# Copyright

This document has been placed in the public domain.